Performance Analysis for Core 2 and K8: Part 1

Pages: 1 2 3 4 5 6 7 8 9

Overall Performance

Performance is a function of three variables: path length of the application, frequency and cycles per instruction. The latter is the best metric for overall microarchitecture performance, since the two CPU frequencies are within 5% and the applications have fixed path length (since they already been compiled). Note that while the path length is fixed, there may be a different code path for each CPU – so do not assume that the path length is the same for the K8 and Core 2.

In many ways, it is more natural to think about instructions per cycle (IPC), which is the inverse, since most CPUs are superscalar. In theory, the Core 2 and K8 can retire 4 and 3 instructions/cycle respectively. In reality though, neither can really get much better than one instruction per cycle (i.e. the CPI is always above 1).

Figure 3 – Cycles per Instruction Retired

A few comments on this data are in order. First of all, the general trend is that the Core 2 has a lower CPI than the K8 – indicating higher performance, which is consistent with actual benchmark results. There are some exceptions though; in FEAR, at low visual quality, the Core 2 beats out the K8 by a pretty substantial margin (~20%), but this is inexplicably reversed at high visual quality. A similar, but far less pronounced reversal occurs for Prey.

While we discarded the results from X3, it is worth mentioning that this is one of the metrics that caught our attention. In both cases, the K8 outperforms the Core 2 on X3. This contradicts both common sense and performance results, and was one of the first clues that AMD’s performance analysis tools might not be correctly reporting data.

Both the K8 and Core 2 decode x86 instructions into smaller, RISC-like uops that are natively supported by hardware. Intel and AMD refer to these native instructions as uops, while they refer to actual x86 instructions as macro-ops. In general, AMD’s uops are more powerful than Intel’s uops – meaning that for a given piece of x86 code, it will require more uops to execute on an Intel CPU than on an AMD CPU.

Figure 4 – uops per Instruction Retired

Figure 4 above shows the native uops per x86 instruction retired. For the most part, the data is consistent with our expectations; the K8 uses fewer uops than the Core 2. The Core 2 averages 1.41 uops/instruction, while the K8 typically uses around 1.28.

The change in the number of uops/instruction for the Core 2 running Prey is surprising. At low resolution, the uops/instruction is exactly average, around 1.4, but at high resolution it jumps up 30% to around 1.8, which is unusually high. Intel’s uop format generally only accommodates 2 inputs and 1 output, so this generally indicates that Prey tends to favor more complex x86 instructions that likely either use 3 inputs, or have 2 outputs.

Pages: « Prev   1 2 3 4 5 6 7 8 9   Next »

Discuss (57 comments)