This article has provided an initial look into the nature of performance in two relatively modern CPUs, a 2.93GHz 65nm Core 2 Duo, and a 2.8GHz 90nm K8. Some of the more salient results from the last several pages are summarized below:
- The Core 2’s IPC is about 5-10% higher than the K8 for our set of games.
- The K8 has 20% fewer uops per instruction than the Core 2 for our set of games.
- The Core 2’s branch predictors are vastly more accurate, with about 50% fewer mispredicted branches per instruction retired for our set of games.
- The Core 2’s instruction cache is slightly more effective, with ~20% fewer misses per instruction retired for our set of games.
- Misaligned instructions are very infrequent for both CPUs – approximately one out of every thousand retired instructions is a misaligned memory access for our set of games.
- 60% of x86 instructions access memory for our set of games.
- The K8’s L1D cache is more effective, with about 20% fewer misses per instruction retired for our set of games.
- The Core 2 Duo’s L2 cache is much more effective, with about 50% fewer misses per instruction retired for our set of games.
Along the way, we learned several other lessons. First of all, performance analysis tools can be very tricky – when we tried to measure instruction cache accesses we got inconsistent results between different tools. This suggests that perhaps the two tools were measuring slightly different events. On top of that, we also got inconsistent results between different runs with the same tool – which just suggests that the tool in question is flaky.
This brings us back to the original point of the article – explaining the disparity in performance between the two contenders for mainstream client systems. The first thing that leaps out is the huge difference in branch prediction accuracy. The K8 mispredicts twice as often – and each mispredict probably ends up squashing around 50-72 instructions (depending on the occupancy of the re-order buffer). So for every 1000 instructions retired, the K8 ends up squashing around 450-600 instructions due to branch mispredicts (9 MPKI). In contrast, the Core 2 is much much more efficient, squashing between 280-400 instructions for every thousand (re-order window is probably between 70-96 with 4 MPKI). The impact on performance is huge, since each time the pipeline is cleared, somewhere around 50-100 cycles worth of work is waste – that translates to 20-100ns per mispredict. The energy costs are just as substantial – each extra second that the CPU is active consumes around 60W.
The other likely performance culprit is accesses which miss in the L2 cache. While memory accesses are rare – around 2 MPKI for Core 2 and 4 MPKI for the K8, the latency is huge, between 120-200 cycles (or higher if there is lots of contention between pending requests). Unlike a branch mispredict, this latency can be hidden to some extent – for instance, often times multiple cache accesses will be initiated in parallel, or other instructions that are independent of the cache miss can be executed, or even a cache miss could occur during a branch mispredict – then when the CPU has started executing again, the data has arrived already. Even assuming optimistically that half the memory access latency can be hidden, that still leaves 60-100 cycles of stalls per cache miss.
One of the interesting factors is the substantial difference in miss rates between the two cache designs, which is influenced by the underlying memory subsystems. Intel’s unloaded memory latency is around 55-60ns, while AMD’s is closer to 40ns and should also scale much better under load. Unfortunately, there is no data available on the loaded latency for the respective CPUs, but a reasonable guess would be that Intel’s loaded latency is 40-70% higher. Given that guess, we can come close to estimating the average latency contribution from L2 misses. Intel has half the number of misses (2 vs. 4) per thousand instructions retired, but 40% higher latency. That implies that Intel’s average memory latency contribution from L2 misses is 75% of AMD’s (or 80% if we assume Intel’s L2 latency is 70% higher). Of course, this is only looking at one aspect of the situation – it ignores the impact of the L1 caches, where AMD tends to have an advantage due to larger capacity. But it’s certainly an area that could contribute to the performance difference between the K8 and the Core 2 and definitely does contribute to the power differences.
Hopefully the first part of this article has helped to shed some light on the performance differences between the K8 and Core 2. Future work will focus more on exploring aspects of each individual microarchitecture, as made visible through performance analysis tools. This will give us an opportunity to explore the efficacy of features like Intel’s micro-op fusion, memory disambiguation and other techniques used in modern microprocessors.
This article required a great deal of effort from the two authors. Both Aaron and David Kanter spent countless hours working on gathering the benchmark results and performance counter data, poring over them, discovering inconsistencies and later trying to track them down and correct the inconsistencies. We were assisted graciously by staff at AMD, Intel and other companies, including those who contributed hardware to this experiment. They all deserve credit alongside the authors, especially the folks who helped us work with Code Analyst and VTune. We’d like to thank Intel, AMD, NVIDIA, Crucial, Western Digital, OCZ and Microsoft for aiding our work.