NEC’s Early Defect Prediction
Presentation 22.3 in the Digital Circuit Innovations track was from researchers at NEC who demonstrated a method for detecting failures in integrated circuits. The motivation for this technique is twofold. In the long term, the semiconductor market will grow fastest in the embedded world; companies have identified medical and automotive applications as key targets. Both of these industries do not tolerate failure well; if a PC component dies, it’s really not a big deal. However, if a brake controller, or an implanted insulin monitor breaks, or worse yet, experiences silent data corruption, someone could easily die. The second trend is that as manufacturing moves to finer and finer geometries, the error rate increases exponentially. The end result is that designers must begin to actively plan for failures in the field, and figure out how to continue operation.
Figure 4 – Defect Prediction Flip Flop
NEC uses what they call a ‘defect prediction flip flop’ (DPFF) to sense the total path delay in a small logic block. Failures that occur gradually over time will manifest as increasing path delays, until the delay exceeds the cycle time. Detecting such a pattern is a relatively simple matter. If the path delay exceeds a threshold value for several consecutive cycles, then a failure is likely to occur. This ensures that a transient failure will not trigger erroneously. As an example, NEC manufactured a 330MHz test chip with a pseudo-defect circuit. The cycle time is 3.03ns, and the ‘warning’ band was set at 95ps, with a threshold of ~2.93ns
The DPFF is used in conjunction with fine grained redundancy. The entire logic portion of a chip can be broken up into small regions, with a DPFF between each region. When a failure is likely, the DPFF switches off the main logic block and has the back up take over; this sort of ‘logic failover’ prevents any errors from occurring. The only way for a fatal error to occur is if both a logical block and the redundant block are hit by errors. However, since the logic can be divided into very small blocks, this is unlikely to happen (imagine redundancy at the functional unit level, versus at the core level). According to NEC’s experimental results, their early defect prediction and fine grained redundancy is superior to 4 way redundancy without prediction, but only uses 2.5x the area of a normal design. After 2 defects, 81% of the NEC test chips were still functional, and 59% were functional after 5 defects, compared to 33% and 1% respectively for a 3 way redundant architecture.