As the data on the previous page showed, not all loads and stores hit in the L1D cache. Frankly, that’s not a tremendous problem; a modern out-of-order processor can pretty easily hide the latency of an L1D miss, which is likely to be 7-15 cycles. The real problem is when a load or store misses in the last level cache (LLC), which for this crop of processors is the L2. There is really no way to deal with the latency of accessing memory, which takes around 200 cycles or longer.
The Core 2 Duo uses an inclusive and shared 4MB L2 cache with 16 way associativity and 14 cycle latency. The K8 sports two exclusive and private 1MB caches, also with 16 way associativity and 12 cycle latency.
Figure 15 – L2 Accesses per Instruction Retired
To start, we look at how often the L2 cache is accessed, which is a function of the L1D and L1I cache miss rates (the L2 is only accessed on misses in an L1 cache). In general, Intel’s L2 cache is accessed about 2-3x more often than AMD’s, because of the relative sizes of the L1 caches. AMD’s larger caches are able to reduce the pressure on the L2 significantly because it provides a higher hit rate (although the two were about even for Prey).
Figure 16 – L2 Misses per Instruction Retired
Unlike L1D misses, which can be hidden by out-of-order execution, every miss in the L2 cache results in a 200-300 cycle memory access, which is quite significant from a performance (and power) perspective. The L2 MPKI results generally show the advantage of Intel’s substantially larger 4MB cache.
Some of the results are mildly puzzling though; there is no reason that the K8’s 1MB cache would outperform Intel’s far larger cache for Far Cry. It is possible that the working set is so large that it frequently misses even in a 4MB L2 cache, but the miss rate for a smaller cache would be even worse. The most plausible explanation is a measurement error with the K8 performance counters.
Figure 17 – L2 Cache Hit Rate
The last bit of data regarding the L2 cache performance is the hit rate. Unlike the MPKI, this adjusts for the fact that Intel’s L2 cache is more heavily utilized due to the smaller L1 caches.
The L2 hit rates are both surprising and expected at the same time. The hit rate clearly shows the expected advantage of Intel’s much larger capacity in a way that the MPKI does not. On average, Intel’s hit rate is around 90%, while AMD’s hit rate is roughly 75%. The magnitude of the difference is unexpected, as is the absolute value for AMD’s hit rate. We suspect that the hit rates for AMD are deflated; for example, the 67-70% hit rates for Far Cry seem too low to be credible, but this is the only data we have now.