A Tale of Two Caches
The AMD K7 L1 data cache is a 2-way set associative 64 KB design that implements pseudo dual port access using a multi-bank architecture. The basic design is shown in Figure 2. A data memory access starts in pipe stage 8, when an effective address is created within one of the three address generation units (AGU). This typically involves adding a displacement to the contents of a base register, such as the stack pointer ESP. The effective address is sent to data cache through two levels of multiplexors (MUXes), and this takes most of pipe stage 9. The address that reaches the data cache is still a logical address, that is, it hasn’t been converted to a physical memory access.
The K7 data cache system performs three operations simultaneously starting late in pipe stage 9, and stretching over most of pipe stage 10. It uses the low bits of the logical address to initiate read operations in the cache data arrays and the cache tag arrays. The third operation is a content addressable memory (CAM) based search of the data translation lookaside buffer (TLB) for a valid logical to physical address mapping for the current effective address.
Figure 2. AMD K7/Athlon L1 Data Cache Design
Assuming a TLB hit, the physical address from the TLB is compared against the two tag values associated with the two ways of the data cache. If data is present in either way, then the physical address will match one tag and the data associated with that way will be steered through the way MUX and be sent to the cache output. Because the way size (32 KB) is larger than an x86 page size (4 KB), some address bits used to index the tag and data arrays are also translated by the TLB. This means that the K7 data cache must also handle a situation known as virtual index aliasing. When that happens, the physical address matches both tag values. The aliased cache lines must be invalidated (and written back if dirty). The desired cache line is brought back into the cache. When everything works without a hitch (the vast majority of the time), the data is available to the processor core at the end of pipe stage 10. As shown in Figure 2, this results in a 3 clock cycle load-use latency. That is because a dependent instruction must wait for the data to be available at the end of the load operation’s pipe stage 10 in order to feed into the beginning of its pipe stage 8.
Be the first to discuss this article!