3D Processor Design Results
Latency between certain performance critical blocks was reduced by judiciously stacking these blocks close together, resulting in higher performance. For instance in the planar Pentium 4, the L1D cache was placed beside the functional units. The worst case operand latency is when the operand traverses from the far end of the data cache to the farthest functional unit. In the stacked Pentium 4 implementation, the functional units are placed right under the center of the data cache. This reduces wire length and latency, enabling one pipeline stage to be eliminated in a performance critical part of the design. Another pipeline stage was removed in the floating point cluster. In the planar implementation, a register file has to drive its operands not only to the multimedia (SIMD) unit, but across it and into the input of the floating point unit as well. The floating point unit requires two more cycles than necessary to access its operands as a result of this arrangement. In the 3D redesign, the multimedia unit is left beside the register file on the bottom stack, while the floating point unit is placed directly over the register file, on the top stack, removing two clock cycles due to wire delay and reducing the access latency . As a result, both units have optimal access to the register file without penalizing either use case.
Figure 11 – 3D Pentium 4 with data cache and FP unit on top; SIMD unit and register file on bottom (Source: Intel)
Besides stacking whole functional units on top of one another, another technique Intel demonstrated is splitting the larger units into smaller pieces, with each slice occupying a layer of the stacked implementation. The benefit of this approach is reduced intra-block latency and power consumption, which complements the reduced inter-block wires and power savings. The Pentium 4’s large 1MB L2 cache was split into two sets of smaller sub-arrays which reduced cache line read latency by a 25% and cache power dissipation by 20%. The hottest block in the planar Pentium 4 is the dynamic instruction scheduler, which chooses ready instructions to execute in any given cycle. The instruction scheduler was split in a manner that greatly reduced critical internal wires, resulting in a 15% shorter access latency. Since the instruction scheduler is a critical part of the design, intricately timed to require a small fraction of the stages of the overall pipeline, the reduction in latency was not used to eliminate pipe stages but to enable a less aggressive circuit implementation to be utilized. The scheduler circuitry was converted from dynamic to static logic, eliminating half the power dissipation of the block, with similar power density and block-level performance.
Chart 1 – Temperature and Power of 2D and 3D Pentium 4
The redesigned 3D layout of the Pentium 4 eliminated approximately a quarter of the pipeline stages from the final design, as shown in Table 1. These pipeline stages were extraneous, as they were used purely to drive signals across wires from one unit to another in the planar implementation. This pipeline compaction improved single threaded performance by roughly 15%, leading to a much more efficient design. Since most of the pipeline stages removed were dominated by global interconnect, the new 3D design halved the number of repeaters. In conjunction with more efficient intra-block interconnect, these two improvements achieved a 15% decrease in total power consumption . Substantially increasing performance while decreasing power is an unusual feat in modern microprocessor design. Most improvements, such as out-of-order execution or multithreading, increase performance, but add to power consumption. This simple redesign of the Pentium 4 leads to a 35% improvement in efficiency, as measured by performance per watt. That is the beauty of three dimensional integration: in many cases the designer can have the best of both worlds.
Table 1 – Selected pipeline stage reductions and related performance improvements
The power consumption of the planar Pentium 4 is 147W, while the stacked implementation consumes 125W. However, the peak power density of the 3D design did increase. The hottest part of the processor rose from 99 degrees centigrade to 113 degrees, as displayed in Chart 1. This double digit thermal increase at the hotspot is problematic, since it will likely impact the operating reliability and long term durability of the integrated circuit.
In modern microprocessors, power dissipation is dominated by dynamic power, which is the current expended to discharge the parasitic capacitance of transistors and interconnect. Since dynamic power is proportional to the product of the switching capacitance, the frequency, and the square of the supply voltage; a small decrease in frequency enables the power supply voltage to be reduced which results in a disproportionately large decrease in power dissipation.
In order to reach neutral thermals, the frequency of the 3D Pentium 4 processor was reduced, with a corresponding drop in its supply voltage. The net result was a 97W thermal envelope; a total power reduction of one third, as compared to the original design. More importantly, the peak hotspot temperature of the stacked implementation was brought down to approximately 99 degrees centigrade. Even with these further reductions in frequency, the stacked implementation ran a modest 8% faster than the original planar design . If low power dissipation was the ultimate goal, frequency could be scaled down further. The designers estimated that they could achieve the same performance as the planar design, while cutting down total power consumption by more than half.