The Basics of Willamette
The organization of the Willamette processor is shown in Figure 3. Compared to the P6 design, the primary defining features of Willamette are a trace cache, dual super-pipelined ALUs operating at twice the processor frequency, deeper pipelining, improved branch prediction, and a much higher bandwidth system interface.
Figure 3 Organization of Willamette Core
The trace cache is an innovative first level instruction cache (I-cache) that stores sequences of micro-ops organized in dynamic program execution order, rather than the conventional I-cache in the P6, which stores x86 code in static program order organized by memory location. The effect of this radical change is to basically de-couple branch prediction and translation of x86 instructions to uops from the repetitive (loop based) execution of program code out of the trace cache. It also effectively performs loop unrolling (a code transform sometimes performed by sophisticated compilers for speeding up code) in hardware but without the associated code size expansion.
Figure 4 Basic Willamette Execution Pipeline
As shown in Figure 4, the Willamette is a very deeply pipelined processor. It uses 20 pipe stages to execute integer instructions including the 4 pipe stages associated with fetching uops from the trace cache. If you included the pipeline stages associated with fetching x86 code from the L2 cache, decoding it into uops, and loading uops and program mapping/flow information into the trace cache the total number of pipelines stages probably approaches 30 or more.
The branch misprediction penalty appears to be at least 19 clock cycles when the correct path is present in the trace cache. If the trace cache misses, then the branch mispredict penalty is considerably higher. This compares to a minimum branch mispredict penalty of 11 cycles for the P6 core. The P6 uses the two-level Yeh and Patt adaptive branch prediction scheme. Despite the fact that the P6 predicted branches correctly around 90% of the time it still lost about 30% of its potential performance due to branch mispredicts. Although the Willamette will no doubt use more modern branch prediction techniques like gshare and dynamic prediction strategy selection, its huge mispredict penalty will make its performance very sensitive to the efficacy of its branch prediction algorithm(s) on the particular code being run.
There is no doubt that Willamette will achieve much higher clock rates than a P6 core in the same process. The open question is whether or not it can deliver higher performance than P6 commensurate with, or exceeding, its higher clock rate. Or maybe Willamette is an expensive demonstration that microarchitectural innovation in implementations of the ancient x86 ISA have long gone past the point of diminishing returns.
In the second part of this article I will examine the unique features of Willamette like the trace cache and double speed ALUs in detail.
Be the first to discuss this article!