32nm CPU Logic
Generally speaking, the combinatorial logic that performs calculations inside of modern chips scales relatively well with voltage. Variability is a significant concern, but typically manifests as slower operation, rather than incorrect results. To address variability and preserve high performance, several changes in logics design techniques were necessary.
The first set of changes focused on logical design and eliminating highly loaded logic circuits that are more vulnerable to variability. Intel’s simulation results shown in Figure 1 indicate that large fan-in and fan-out configurations significantly worsen variability. Restricting the logic libraries to 3 output stacks and 3 inputs for multiplexers improved low voltage performance by cutting the delay in half. Of course, these design rules tend to increase the transistor count and area for functions that were previously implemented with more heavily loaded circuits; the area impact was estimated as 10%.
A second set of changes emphasized the importance of physical design, and dealt with variability at the transistor, rather than gate, level. The first set of simulation results in Figure 2 show the impact of device threshold on variation, with substantially worse problems for high Vt devices. Moving to nominal Vt substantially reduces delay and improves performance at low voltage, but will also increase leakage power. The second set of simulations show variability problems for minimum sized transistors when operating at 0.5V. Minimum sized devices increase delay by about a factor of two; so requiring larger devices significantly reduces variability issues. The downside is that larger transistors end up using more area, although this only impacts logic and the cost is around 5%.
32nm CPU Memory
In contrast to logic, memory elements like latches, registers, and SRAM are far more difficult at low voltages and are commonly the limiting factor in modern designs. Whereas logic tends to slow down and rarely produces incorrect results, memories are incredibly sensitive to voltage and are very likely to produce incorrect data. Conveniently, larger caches can easily be placed on a separate voltage plane from the logic in a CPU or GPU core. However, the logic circuits in a core, such as floating point units, still incorporate a variety of state elements such as latches and register files.
As many are familiar, Intel already uses 8T cells for low voltage SRAMs. This practice started with the L1 data cache on the 45nm Silverthorne and Nehalem. To operate at NTV, further efforts are necessary to reduce the read and write voltages. Intel’s engineers use a 10T cell, which can write data at 0.25V lower than an 8T cell. The read circuits are programmable, and can shift the voltage to compensate for variation, ensuring robust reads at low voltage. The area impact is around 20%, compared to 8T cells.
The other significant type of memory are the flip-flops that hold data in logic circuits. Even a simple adding circuit requires a little storage to hold the input and output operands, before they can be safely stored in the registers or memory. Intel’s team also described several techniques to improve the stability of flip-flops at low voltage. All together, the NTV flip-flops are about 35% larger than normal.
NTV Pentium Results
The 32nm NTV Pentium core is 2mm2 and uses 6 million transistors. The processor is split into two voltage domains, one for logic, and another for caches. The caches can safely operate down to 0.55V, while the logic can reach 0.28V; the maximum voltage is 1.2V. The entire chip is synthesized for 0.5V operation, to ensure good clock frequencies at low voltage. Targeting a high voltage would substantially reduce near-threshold performance, while only saving a little leakage. Additionally, the floating point unit is power gated based on instruction level activity, with a single cycle wake-up.
Figure 3 shows measured frequency and power as a function of logic and cache voltage, scaling from 2mW at 3MHz up to 737mW at 915MHz. At the lowest voltage levels and 3MHz, the power consumption is almost entirely leakage from the caches (62%). At the near-threshold voltage levels and 80MHz, the logic activity is 53%, while leakage is 42%. At the maximum voltage, power is almost entirely due to logic activity (81%). The most optimal operating point is 0.45V for logic and 0.55V for memory, achieving 4.7× energy savings.
Discuss (86 comments)