Mankind Still Has a Part to Play

Christof Fetzer
Christof Fetzer

In some circumstances human life can depend on what are known as “critical systems.” Can you give us a few examples of these systems?

Fetzer: Correct functionality of critical systems can be important for either human safety (safety critical), the survival of a company or the prevention of financial losses (mission critical). Use of these kinds of systems is increasing all the time. One can find examples in different industries like aerospace, railway, power, medical or chemical. Often, mechanical systems are being replaced by computer-controlled electromechanical systems. One example from the automotive domain is “brake-by-wire” technology, a mechatronic brake. The term describes how the braking signal is sent to the brake. The motors are controlled by an embedded computer and activated via an electronic brake pedal. This technology is also used, for example, in high-speed trains.

What type of errors can cause computers to crash in these systems?

Fetzer: Possible errors include hardware, software, user, security and network errors. The problem in this case is that there are usually too many causes to deal with them all correctly. For example, the value of program variables could be corrupted by both software and hardware errors. In fact, the routines for error recognition and error handling are themselves often faulty. They are executed relatively infrequently and are for the most part difficult to test. Furthermore, they rarely cover all possible error scenarios.

How did you come up with the idea for the project?

Fetzer: People usually assume that software is executed correctly by the hardware, which means they focus on validating the software. Although it is true that the reliability of processors in computer systems has improved dramatically in recent decades, statistics show that hardware faults in large systems still occur more often than software errors. The commercial value of our software is that it would allow even low-cost hardware, also known as commodity hardware, to be installed in critical systems.

How do “dependable systems” differ from these critical systems?

Fetzer: Dependable systems – also known as “trustworthy systems” – are those in which the probability of system failure is negligible. This means that errors have to be detected and masked – component failures must not be allowed to cause a system failure. Therefore, dependable systems need to have a fault-tolerant design. They often use failure virtualization to simplify fault handling. Virtualization converts a failure that is difficult to handle into a simpler one. For example, the program could convert all hardware faults, such as “bit flips” or incorrect results in a crash failure. As a result, a program developer only has to focus on crash failures from individual processes, instead of having to check whether all the results are correct.

What do you want to achieve with the software you’re currently working on?

Fetzer: The software implements a failure virtualization layer. The goal is simple – we want to guarantee that a program will always be executed correctly. Even if the hardware has a design, production or runtime error, a program must either be executed correctly or terminated. In this context, the probability of incorrect output is negligible. Another advantage of virtualization is that it enables critical programs to be executed safely on “commodity hardware,” thus lowering the hardware costs.

What is the concept behind your software?

Fetzer: We perform software encoded processing. Programs are executed in an encoded fashion and each incorrect execution results in an incorrect code. We define a specific probability such that all errors are detected with at least that probability. The selected probability is so small that system failures caused by incorrect execution are negligible.

How can your software prevent accidents in practice? Does it cover all eventualities?

Fetzer: We guarantee with a given probability that incorrect output will not be caused by hardware faults. A terminated action can either be re-executed locally or remotely. We cover all hardware failures, which means we can identify all failures without having to make any assumptions about the type of failures that might occur. In addition, software failures need to be addressed, that means, users must check whether the software works properly and whether the right software is being run. Therefore, in other works we are also concerned with model checking, a procedure for validating programs (models) in a fully-automated process. Furthermore, we are also investigating code injection attacks, which is a way to “inject” code into a computer program or system. This method is usually used for negative purposes, such as gaining unauthorized access to systems. It thus reduces the reliability of systems.

What difficulties have you encountered during the development?

Fetzer: It is very difficult to encode general programs. The code for small cyclic programs i.e. those that execute the same code cyclically – used, for example, in programmable logic controllers (PLCs) – can sometimes still be calculated statically. However, most applications are non-cyclical programs. In this case there are too many possible paths to be able to calculate the code statically. Finding a solution to this problem was the biggest hurdle during the development process, but we overcame it. Previously, we only knew how a few operations were coded. Now we have found codes for all possible operations. For example, even dividing two numbers was not possible using code before.

Do you already know where you would like to see the software in operation?

Fetzer: We do, of course, have a few key areas in mind where we would like to test and install the system. The automotive industry would be an ideal candidate. However, a range of other critical application areas could also benefit from the development, for example aerospace, chemical industries, railway, power etc. We are still looking for partners.

When do you hope to complete your software?

Fetzer: We have installed a prototype at TU Dresden which is currently being validated using error injection experiments. We have developed a tool that not only injects errors into programs, but also tracks how they are propagated and detected.

In the near future – from around 2012 – the variability of transistors on chips will increase to such an extent that a large number of transistors will not be able to work correctly anymore. This means it will be very difficult to guarantee that programs are executed correctly. Our software would be able to offer such a guarantee, since the probability that it would not detect an incorrect execution is negligible. We hope that our software will be ready for a wider market within the next five years.

Will mankind ever be able to rely 100 percent on technology?

Fetzer: There is no such thing as 100 percent dependability. Therefore, computers should only be installed in critical systems when they increase dependability. In many cases, this means that a human still has to be “in the loop” to monitor these systems.