Safety-Critical System Design

Simplify Safety-Critical System Design

October 9, 2017

Nicholas Cravotta

As industrial systems become more complex, safety certification is a growing challenge. Building integrated systems that combine safety-critical and non-critical features is particularly difficult.

Developers can tackle the challenge with the Intel^® Xeon^® processor D-1529 for industrial IEC 61508 certification, the first Intel^® processor designed for SIL 2 certification. This functional safety solution provides a tightly integrated package comprising hardware, software, and certification documentation that simplifies safety certification, and it enables developers to replace redundant systems with a single hardware platform – significantly reducing costs.

The Basics of Functional Safety

Before we get to the new processor, let's review the fundamentals of safety-critical design. Designing safe systems begins with risk assessment. Consider the way IEC 61508 categorizes risk based on the likelihood of the event and the consequence if the event occurs (see Figure 1).

Figure 1. IEC 61508 places risk into four categories. (Source: Wikipedia)

To meet IEC 61508 requirements – or any safety regulation – developers must assess the risk for determined hazardous events. Then they must design the system to avoid unacceptable risks, and put mechanisms in place to accommodate the events that do occur.

A First In Certified Processors

To create robust systems, safety must be considered from the onset of design. Thus, safety begins with the hardware architecture.

As Intel's first processor for IEC 61508 certification, the Intel^® Xeon^® processor D-1529 comes with a variety of features not found on general-purpose chips. These include hardware upgrades along with software, tools, and documentation to accelerate development of IEC61508 safety integrity level (SIL) 1 and 2 applications.

On the silicon front, the processor integrates extensive hardware diagnostics. These include:

Over-voltage/over-current detection
Processor temperature reporting
Machine check exceptions
PCIe advanced error reporting
Platform controller hub (PCH) error logic
SATA and AHCi diagnostics

Safety features are also integrated into the software architecture supporting the processor. Programmable error exceptions and software-generated exceptions enable developers to ensure the reliable operation of application code.

Supporting tools include the Intel^® Software Test Library (Intel^® STL). This library simplifies enabling offline and online software diagnostics, software validation, and fault injection.

Replacing Redundant Systems

A key advantage of the intel approach is its ability to replace redundant systems with a single platform. As an example of this approach, Laurent Remont, CTO of Kontron, points to safety-certified computers for rail.

The traditional approach is to use redundant processing cards, explains Remont. For example, Kontron offers a 3U VPX system with redundant blades (Channel A and Channel B blades in Figure 2) linked through an Ethernet switch and monitored by a gateway CPU blade. To further guarantee availability, the entire 3U VPX system is then duplicated.

Figure 2. Kontron's rail computer includes multiple layers of redundancy. (Source: Kontron)

With the Intel^® Xeon^® processor D-1529, developers can shrink Channel A, Channel B, the Ethernet switch, and the gateway blade onto a single blade. Instead of running the redundant code on different blades, developers can now run the code on two different cores of the same processor.

According to Remont, this hardware consolidation relies on the fact that the processor cores have been pre-certified by Intel for their independence and robustness, as well as the redundancy ensured by the Intel safety library.

Remont notes that the overall reliability and cost of a single-blade solution is significantly optimized when compared to the traditional architecture. This is particularly true considering that many traditional systems use a mix of CPU architectures to ensure reliability. By running all workloads on the same architecture instead, developers can speed time-to-market, reducing total cost of ownership, and accelerate the certification process.

Running Workloads with Mixed Critically

One of the most important issues complicating safety certification is mixed criticality. Complex systems have a variety of workloads that need to meet varying levels of safety requirements, based on their probability of occurrence and impact in case of failure. Further complicating design is the use of virtualization, which allows a single hardware controller to accommodate workloads for multiple machines.

The introduction of IT/business workloads and IoT connectivity to applications like factory automation adds yet another layer of complexity. Being able to monitor factory equipment remotely can significantly improve operating efficiency by turning real-time data into actionable intelligence.

But smarter manufacturing requires a robust and secure implementation that does not compromise the reliable operation of the system. The workloads that enable these advanced features could be third-party software and may not have been written to meet stringent safety requirements. Consider how a communications library that prioritizes guaranteed delivery could violate critical real-time deadlines as a consequence.

To mitigate such risks, systems need to be able to isolate and protect the safe workloads from "non-safe" workloads. In other words, if the user interface (UI) crashes, the failure must be contained to prevent the main system from acting unpredictably. Similarly, if a machine in a virtualized environment goes down, this should not impact the other virtual machines sharing that environment.

Using a Safety-Critical OS

For these and other reasons, safety needs to be an integral part of a system's OS. As such, Wind River supports the Intel^® Xeon^® processor D-1529 with its safety-focused Linux and VxWorks OSs. These OSs provide a virtualized environment with advanced time and space partitioning capabilities that can support mixed criticality across diverse workloads.

Wind River also supports the platform through its Simics virtual development environment. This platform allows developers to model the Intel^® Xeon^® processor D-1529 along with the surrounding system ahead of hardware availability (Figure 3).

Figure 3. Wind River Simics enables whole-system simulation. (Source: Wind River)

Managing Complexity

Designing safe systems is only going to get more complicated as the Internet of Things becomes more pervasive. By considering safety at the onset of design and integrating safety mechanisms from the ground up, OEMs can have confidence their products will meet safety requirements today and tomorrow.