So How Can Willy Do it?
There is not much wasted time in the K7’s data cache access scheme. Although the Willamette’s 2KB way memory arrays can be read more quickly than the K7’s 32 KB way memory arrays in similar process technologies, the difference is far too small for Willamette to go to 2 cycle accesses using conventional techniques without reaping a huge clock frequency penalty. I believe the Willamette employs a recent innovation in cache design called sum-addressed memory access. Sum-addressed memory first became widely known with its use in the 64 KB, 4-way, 2 cycle data caches in the Sun UltraSPARC-III processor .
In order to understand sum-address memory we must review the K7 pipeline again. During stage 8 the effective address is generated in the AGU. Creating an effective address typically involves one or more additions. A conventional 32 bit addition is a time consuming operation which can take up the majority of a clock period in a modern processor (Willamette’s infamous double frequency ALUs likely use a design approach I speculated about in my earlier article; that approach is not applicable to the AGU). The sum-address memory scheme basically allows the memory access to begin using the inputs to the AGU and avoid waiting for the result of the AGU addition. The difference between a conventional cache and a sum-addressed cache is shown in Figure 3.
Figure 3. Sum-addressed Memory Operation
Instead of calculating the effective address and sending it to conventional word line decoders in the cache memory array, the sum-addressing scheme sends two values to the word line decoders. At first glance this seems utterly absurd. Don’t we end up with a whole bunch of adders in the memory and still have the same time lag? The answer is no, and the reason is due to a peculiarity in binary arithmetic. For arbitrary values A and B, the digital logic required to determine if A+B = K for a constant value K is much faster and simpler than the logic needed to add A and B together. That means sum-addressed row decoders can locate the cache line indexed by A+B more quickly than it takes to add A and B together (which will still have to do anyway in order to generate the effective address input to the TLB).
That is all very good, but Willy’s data cache is 4-way set associative. That allows it to avoid the aliasing problem, but now we have four blazingly fast 2 KB sum-addressed data arrays to implement the four ways of the data cache. During every access we end up with four data values and we don’t know which one is the right one until much later when the tags are compared. The answer is to do the same thing modern processors do with conditional branches: if you can’t be sure of something, predict the outcome and run with it, but be prepared to clean up the mess if your prediction is wrong. If your prediction accuracy is high enough, then you come out ahead in the game. So I believe the second new element in Willamette’s data cache is way prediction. The use of these two concepts together have been seen before in the form of a data cache of an IBM experimental PowerPC MPU described in . In this chip, a 64 KB, 2-way, sum-addressed data cache uses a sum-indexed way predictor and achieves 2 cycle accesses at 1 GHz in a 0.22 um copper process. The hypothetical 2 clock cycle Willamette data cache design I have described is shown in Figure 4.
Figure 4. Hypothetical Willamette Data Cache Design
During pipe stage 17, the base and offset values used to generate the effective address are also passed onto a sum-indexed way predictor and a sum-addressed memory array decoder. The latter circuit chooses 1 word line to simultaneously assert in the four 2 KB memory arrays that comprise the 4-way, 8 KB data cache during pipe stage 18. One of the four accessed 128 bit values is selected by the way predictor using the 4:1 way MUX, and the selected value is available in good time for transport to the execution units.
Be the first to discuss this article!