High Productivity Transistors
It is important to control the overall transistor count (and aggregate transistor width and drain and source area) in a low power processor to minimize power consumption due to leakage current. Execution resources must be kept on the lean side with the idea of capturing the bulk of performance potential in average applications, while avoiding the temptation of throwing more transistors at capturing a bit more performance in a minority of programs. A desktop MPU designer might push to the point of diminishing performance returns by including, for example, a third integer execution unit. But that third integer unit is idle most clock cycles in the majority of programs. Gating the clock to the idle functional unit helps reduce wasted switching power but does nothing to reduce the leakage current power associated with its thousands of transistors. It is conceivable that the actual power supply to idle functional units could be cutoff with large power gating transistors to reduce leakage. But the large power down and power up latency, combined with the significant switching energy associated with turning the power gating transistors on and off reduces the attractiveness of such a scheme.
A low power MPU designer must objectively judge the performance benefit of extra execution resources versus the cost in higher base level power consumption. In this regards, the designer will more likely favor the lean execution resources approach of the P6 core than the more generous but power hungry K7 core, or highly pipelined and speculative P4 core. One area where more transistors might be warranted to beef up a P6-style back end execution engine is in the area of fixed point and SSE execution units. This is particularly important for mobile computing applications where, for cost and power considerations, it may be desirable to have the host processor carry a good chunk of the signal processing load associated with high speed copper and wireless based telecom interfaces.
For good performance it is necessary to maximize the fraction of the time the limited number of functional units are active, and effective instruction and data caches are vital to achieve this. The data cache would likely be conventional, 32 or 64 KB in size with three cycle load-to-use penalty. It is difficult to achieve the 2-cycle latency of the Pentium 4 data cache without either employing the P4’s power hungry data speculation techniques, using a very small cache, or severely limiting processor clock frequency. A trace cache at least comparable to the Pentium 4’s in size and associativity would be desirable to keep the x86 decoder idle as much as possible.
But transistor economy doesn’t mean all microarchitectural features must be cut to the bone. Some features actually reduce power by reducing unproductive or power hungry operations. An example of unproductive operations are the speculative execution of instructions past an incorrectly predicted branch. In this case all the energy expended executing those instructions is wasted because the results must be discarded prior to the processor resuming execution in the correct branch direction. A sophisticated branch predictor requires more transistors than a simpler design, but can easily increase computational energy efficiency as well as performance by reducing the frequency of branch misprediction and associated speculative execution down false paths.
Discuss (78 comments)