3D Processor Design
Intel conducted two system level experiments breaking up an existing microprocessor into two stacked, cache memory and the processor core itself . One experiment was a simulation driven study, while the other was an actual working stacked silicon device; we will focus on the motives, results and implications of the manufactured device.
Current planar microprocessors are exceedingly complex designs, composed of many functional blocks interconnected through global wires. The more complex the design, the larger the surface area of the processor core, which increases the average global interconnect length and hence, signal propagation time. Modern processors have devoted whole pipeline stages to driving signals across the chip . Unfortunately, longer pipelines reduce overall performance by increasing the number of in-flight instructions, and hence, the branch misprediction penalty.
Figure 10 – Planar Pentium 4 layout illustrating signals driving operands between and across functional unit blocks (Source: Intel)
The Intel Pentium 4 processor used in the experiment is a very high frequency design with a thirty stage miss-prediction pipeline is shown in Figure 8. The processor was broken into two smaller die (Figure 10) – each half the size of the original – stacked on top of each other in a face to face arrangement. Because the Pentium 4 processor core is primarily comprised of logic blocks, a minimal three dimensional arrangement of two stacked die was sufficient to capture most of the benefits of 3D IC without compromising the resulting design due to power density issues. Logic elements switch and consume power at a much higher rate than memory circuitry. This presented a problem since existing thermal issues could be exasperated due to hot spots in the design. To remain within the thermal limit of the original design, very active power regions could not be placed on top of each other, as this would increase power density and temperature. Blocks had to be carefully arranged to complement each other, power wise.
Intel has one chief goal for 3D integration: reducing the length of metal interconnects. Therefore face to face stacking was chosen by Intel, as it minimizes the inter die interconnect distance. It reduces the length and latency of the inter die vias – as well as their width – since they do not have to tunnel through the silicon substrate of each die, like in a face to back arrangement. This denser arrangement places more transistors within a clock cycle of each other, reducing global metal interconnect latency as a proportion of cycle time, as well as improving overall power consumption. Reducing the metal wiring between functional unit blocks results in a processor design that is limited more by transistor switching than interconnect delay. In this particular instance, Intel achieved both higher frequency operation as well as fewer pipeline stages . and the shorter pipeline decreases the branch misprediction penalty and therefore increases efficiency and performance. Power dissipation is also reduced as a result of lower wire capacitance and fewer repeaters associated with global metal interconnects. Additionally, the latches and flip flops for the removed pipeline stages can be eliminated.