Putting It Altogether
It should be obvious by now that the trace cache and doubled frequency ALUs in Willamette shows that Intel has come up with a delightful bag of new tricks to teach the old x86 dog. So how fast is it going to be? Well it is obvious from the pipeline diagram of Willamette in part one of my article that it is designed for speed. It even has two complete pipeline stages (named “Drive”) which ostensibly reserve an entire processor clock cycle just to move signals from one part of the chip to another. A conservative guess is that the Willamette will achieve at least 50% higher clock frequencies than the P6 core in the same manufacturing process.
It gets trickier trying to estimate how Willamette will perform compared to P6 on a clock for clock basis. One should realize that it takes a great deal of renovation to a microarchitecture just to keep from losing clock normalized performance from non-scaling memory latency when one increases the clock rate. There are also a huge number of details about Williamette that Intel has not disclosed yet, including all the warts. We don’t know the instruction scheduling and dispatch rules. We don’t know how the Willamette’s branch prediction hardware works and how effective it will be. The efficacy and robustness of the trace cache on a variety of new and existing applications has yet to be assessed.
But it is likely that Willamette can issue 6 uops per clock cycle, twice as many as P6. The Willamette can issue up to four integer operations to its two double frequency ALUs each processor clock cycle, again twice as many as P6 (although we don’t what restrictions apply). The trace cache hides branch prediction and x86 instruction decoding from the Willamette’s out-of-order execution engine, performs limited loop unrolling, and reduces the time to switch the uop stream to the correct address when a branch mispredict is detected and the alternate path code is resident in the cache. The Willamette also has a much larger instruction re-ordering window than the P6 and can support more than a hundred instructions in flight at a given time along with 48 outstanding loads and 24 outstanding stores.
I would go so far as estimate that Willamette might achieve 20 to 30% higher performance than P6 on a clock for clock basis on most integer code. Including the 50% or more higher clock rate, that is equivalent to an absolute integer performance approaching twice that of the P6 in the same manufacturing process. That increase seems staggering, but it is much less, for example, than the over three times higher system bus bandwidth advantage Willamette enjoys over current P6 implementations. On memory performance limited code the Willamette might dominate the P6 even more.
The Willamette is a serious threat to end the success AMD has been enjoying with its K7 Athlon device in the high end of the x86 processor marketplace. It will likely support clock frequencies about 50% higher than P6 and 30% higher than K7 in a similar process (although AMD will soon benefit from a switchover to a copper interconnect based 0.18 um). Everything else being equal, Willamette will likely enjoy at least a similar level of clock normalized performance or IPC advantage running general purpose integer code over K7 that the K7 enjoys over the P6 core. The Willamette will likely be larger than the K7 core in similar processes, although not dramatically so. The trace cache occupies a great deal more area than the 16 Kbyte I-cache in the P6 and maybe 50% or more area than the 64 KB I-cache in the K7. However, this is balanced by the fact that Willamette likely devotes less area to x86 instruction parsing, alignment and decoding than the K7 or even the P6. Willamette runs at very high clock rates and will likely consume at least twice as much power as P6, or somewhere in region of 60 Watts or more
Be the first to discuss this article!