What is an Adequate Size for a Trace Cache?
The uops in the original P6 core are 118 bits long. The uops stored in the Willamette are unlikely to be any smaller than P6 uops since the instructions set has increased in size and there are apparently a few extra bits needed to help correlate uops with the originating x86 macro instructions. Let’s assume the uops stored within the data array of Willamette trace cache are 120 bits long. The Intel trace cache patent suggests an implementation of 256 sets by 4 ways by 6 uops per way. This totals 256 * 4 * 6 * 120 or a total of 90 Kbytes of SRAM. This is quite large compared to the 16 Kbyte capacity of the I-cache found in recent P6 implementations. Also, the tag and control data for the trace cache are almost certain to be larger than the tags found in a conventional cache design. The Willamette has been attributed with 34 million transistors or about 6 million more than the Coppermine Pentium III. No doubt a good chunk of the extra transistors are swallowed up by the innovative but expansive trace cache.
How does one judge the adequacy of our hypothetical 90 Kbyte trace cache? Should it be larger? Or is it wastefully large? One simple way of assessing this issue is to look at the relationship between x86 instructions and uops. The average length of IA-32 x86 instructions has been reported as 3.2 to 3.7 bytes in length. Let’s assume it is 3.5 bytes (28 bits). In the P6 design an x86 instruction generated, on average, about 1.5 uops. Assuming a rough equivalency between P6 and Willamette uops, an x86 instruction expands to about 180 bits of uop on average. This is an expansion ratio of about six to one. The Willamette trace cache does prevent complex x86 instructions from polluting the trace cache, but these instructions are probably so rare as not to significantly affect the expansion ratio. So our hypothetical 90 Kbyte trace cache data array holds, at most, the uop equivalent of about 15 Kbytes of x86 code. I say “at most” because trace segment members are not necessarily always fully packed with 6 useful uops. I doubt this 15 Kbyte equivalency figure I’ve derived is nearly the same as the 16 Kbyte I-cache size of the P6 by accident. No doubt the Willamette’s designers wanted to limit the size of the trace cache so as not to impact the clock cycle time but without making it less effective in exploiting program spatial locality than the conventional I-cache in the P6-based processors it will replace
Be the first to discuss this article!