A Power Constrained High Performance x86 MPU
So what is the best way to design a high performance x86 microprocessor to limit both switching and leakage current power? The obvious conclusion of the qualitative arguments of the previous section is neither a 20 stage hyper-pipelined speedracer like the Pentium 4 (P4) nor a brainiac capable of fetching and executing up to 6 instructions per clock cycle like the Merced/Itanium are well suited for power efficiency. Low power MPU designers try to hit the sweet spot in Figure 3 that maximizes computational energy efficiency. So what kind of 0.18 um x86 MPU represents the sweet spot, the near perfect balance between clock frequency and IPC?
Very likely it will be an out-of-order (OOO) execution design employing the traditional x86 instruction decomposition to micro-ops (uops) used since the P6 core was introduced nearly 6 years ago. It is true that OOO designs are more complex than in-order superscalar designs like the P5 Pentium, normally something to avoid in order to minimize power. But with the huge and growing disparity between processor clock rates and DRAM access and cycle times, the 30% or more performance penalty of turning back the clock to in-order design makes such a choice seem extremely remote. Once this basic framework is accepted, the major design parameters are the maximum x86 instruction decode rate, sustained and peak uop issue, execution, and retirement rate, and the number and nature of the integer, SIMD, and floating point functional units.
The PIII, K7, and P4 processor cores can all sustain the execution of up to three uops per clock cycle. Is this level of parallelism necessary or is a two uop/cycle core acceptable, particularly if higher maximum clock rates can be achieved? Past investigations of the operational characteristics of the P6 core show that while on average, about 1.1 to 1.5 uops are executed per cycle, about 65% and 80% of uops in integer and FP programs respectively are retired in groups of three . This strongly suggests that the ability to fetch, execute, and retire up to 3 uops per cycle is important to achieving an acceptable level of x86 MPU performance.
The next question is how to generate and feed those uops to the execution core. Both the P6 and K7 can decode up to three x86 instructions per clock cycle. This level of parallelism is costly both in logic complexity and the effect on the maximum clock frequency of the processor. Any effort to simplify this complex and performance critical logic circuitry will help reduce both the switching and leakage current power consumption of the processor as well as potentially offering higher clock rates. The trick for doing this is found in the Pentium 4 trace cache.
The P4 employs a trace cache rather than a traditional instruction cache. This means that most of the time the execution core is fetching and executing uops directly from the trace cache without the involvement of x86 instruction decoders. This is ideal for a low power processor. Consider the fact that it takes a certain amount of energy to identify an x86 instruction and all its optional components and decompose it into one or more uops. In a conventional processor like the PIII or K7, every uop cracked from an x86 instruction is only used once and then discarded, even for code repeatedly executed within a loop. When code in a loop is executed from a trace cache the processor avoids the repeated decoding of the same x86 instructions over and over again by storing and re-using the uops generated the first time through the loop body.
A second effect of using a trace cache is that it is possible to achieve high performance with only a single x86 instruction decoder since it is only invoked to satisfy trace cache misses. The extra clock cycles needed by a single decoder, as opposed to the traditional three-way parallel decoder, overlap with the fetch of x86 instruction from the second level cache or main memory. Because parallel decoders require extensive dependency checking logic, using a single scalar decoder with a trace cache instead of a three-way parallel decoder can simplify the x86 front end logic by much more than two-thirds.
Discuss (78 comments)