On an average day each of us has implicit contact with a large number of electronic systems we are often not even aware of. This is almost independent of what job we have and what lifestyle we live and applies (with variations) to virtually every individual in a highly developed society. Let’s, for example, consider one drives a car in morning to work. We require our car to be safe i.e. we expect our car to take certain actions that ensure avoidance of an accident or, if an accident occurs, we expect it to reduce the negative impact on our health to a minimum. These and diverse other features are facilitated by around 100 electronic systems in a modern car. Another everyday example is healthcare. When one visits a medical doctor one might receive an X-ray screening or other scans. A system might then automatically analyze the scans for potential disease patterns thereby aiding the doctor in a diagnosis. Or, one might carry life sustaining devices like hearing aids, pace makers etc with us. A further example is our cell phone that we carry with us and probably use multiple times a day to conduct business, to talk to our families etc. What we typically do not think of when we talk via a cell phone is that for a single second of talking time hundreds of millions of computational operations are executed on that cell phone and the backbone infrastructure which belongs to the most complex technical systems. Again, a large number of electronic systems (most of them are not visible to us and not even part of the cell phone) makes all this possible.
But what happens if these systems do not work dependable, if they temporarily or completely fail or simply do not function as specified? If, for example, the ABS (Anti-lock Braking System) of the car is delayed for some milliseconds? If a pace maker is providing the pace at the wrong frequency? If a call is dropped when attempting to make an emergency call via a cell phone?
We rely on all this to happen as expected. And in fact, in many cases the trust in the dependability of the discussed systems is justified. More than 40 years in research and development has led to a silicon process (this is the technology that enables to integrate many transistor functions on an integrated circuit which is the core part of an electronic system) that is not only dependable but also complex and thus facilitating all necessary computations in time, at affordable prices etc.
However, there is a major new problem: the electronic systems we talked about are becoming inherently undependable in the near future when migrating towards new technologies!
Simply said, the reason is that the feature sizes (of the basic switching devices on the integrated circuits) are becoming that small (only several tens of nanometers and less) that the fabrication process cannot be entirely controlled resulting in faulty switching devices. But even after fabrication, the basic switching devices are much more susceptible to the conditions of the surrounding environment like heat, exposure to cosmic rays and they more and more tend to ‘age’ (i.e their electrical behaviors change over time) etc.
The International Technology Roadmap for Semiconductors [ITRS] enlists under „Overall Design Technology Challenges“ the five challenges “Design Productivity”, “Power Consumption”, “Reliability”, “Interference and Manufacturability”. This SPP proposal addresses four of these major challenges.
There is no technology in sight that might solve the inherent undependability problem at the level of the basic switching devices. This SPP proposal is fully inline with the latest version of the ENIAC Strategic Research Agenda which states in its edition from November 2007: "Emerging devices are expected to be more defective, less reliable and less controlled in both their position and physical properties. It is therefore important to go beyond simply developing fault-tolerant systems that monitor the device at run-time and react to error detection. It will be necessary to consider error as a specific design constraint and to develop methodologies for error resiliency, accepting that error is inevitable and trading off error rate against performance (e.g. speed, power consumption) in an application-dependent manner"[SRA07].
A possible way to approach this major problem is: accepting the non-dependability at the level of the basic switching devices but making sure that this non-dependability does not propagate to the user of the respective system. The electronic system in a future ABS of a car might have (temporary or permanent) faulty basic switching devices (e.g. transistors, carbon nano tubes etc.) but the ABS should react on time and work in a dependable manner.
The crux is a paradigm shift: to build dependable systems with non-dependable basic switching devices (see also: Shekhar Borkar, Intel, at DATE’07 Conference, Nice, see [Bor07a]).
So, what does it take? So far, the implicit assumption has primarily been (there are certainly some exceptions) that these systems (and their basic switching devices) work in a dependable manner. The circuit design techniques, the computer architectures, the operating systems, the application software etc. are all implicitly based upon this assumption. But again, this is not true any longer. This problem is currently advocated by leading researchers around the world [see: Jan Rabaey, UC Berkeley, Keynote Talk at SOCC’06, Austin]. Hence, almost everything from the physics of the circuits up to the application software need to be re-thought from ground up: computer architectures might need to be changed so as application software design, operating systems, design methodologies etc. Dependability will become a major design constraint as ‘low power design’ became years ago which led to new design approaches and architectures like multi core systems etc.