The Narrowing Road Ahead
What would an optimal performance MPU look like in a 130nm process? Indications are that Intel used various design methods such as repeater insertion and dedicated pipeline stages for global signal propagation to keep global wire delay to a little under 20% of critical path timing for the 3.4GHz top speed grade of Northwood P4. This x86 CPU core uses about 28m transistors and occupies about 100mm2. To achieve the optimal performance CPU criterion of global wire delay equaling latch overhead plus transistor delay portion of critical path timing would require a CPU design about 5x larger and more complex than the Northwood. It would clock at 2GHz, and consume about 180 Watts of dynamic power. With IPC approximately 2.2 times higher than Northwood its performance would be about 1.3 times greater with the same 512KB L2 cache. Obviously such a design proposal pushes die area and device power far beyond any reasonable high volume desktop MPU design. Nevertheless applying the process scaling regime for optimal performance CPU design derived in the previous section to this hypothetical optimal performance 130nm x86 CPU and a remarkable trend is revealed as is shown in Table 2. Keep in mind that integrated L2+ cache size will scale rapidly with shrinking feature size which would accelerate performance faster than shown.
|CPU Size (mm2)||500||170||58||20|
|Clock Frequency (GHz)||2.0||2.9||4.1||5.8|
|IPC (relative to 3.4 GHz P4)||2.2||1.9||1.6||1.3|
|Performance (relative to 3.4 GHz P4)||1.3||1.6||1.9||2.2|
|CPU Dynamic Power (W)||180||88||43||21|
|CPU Power Density (W/cm2)||36||52||74||105|
Table 2 – Optimal Performance x86 CPU scaling with Process
It seems like the current trajectory of x86 MPU design will crash into the global wire delay scaling barrier around the 65nm process node. Beyond that one might expect future design advancement to approximately follow the optimal performance CPU scaling trends I derived earlier, listed in Table 1, and applied in Table 2. It is clear is that CPU cores will have to shrink very rapidly in size beyond 65nm to keep clock frequency and performance moving ahead at the best possible pace. This frees up die space for both larger L2+ caches, more highly integrated system level functionality, and also for the inclusion of multiple CPU core instances on the same device, i.e. chip level multi- processing or CMP. This latter trend will be constrained mainly by power budgeting rather than die size considerations.
Discuss (11 comments)