Performance Analysis for Core 2 and K8: Part 1

Pages: 1 2 3 4 5 6 7 8 9

Instruction Fetch

The next part of the pipeline is the instruction fetch and L1I cache. Unfortunately, the results get a bit confusing here. In theory, the Core 2 and K8 should be fetching at roughly the same rate, give or take 20% to account for differences in IPC. However, the measured data does not bear this out at all.

One possible explanation is that the two different programs (VTune and Code Analyst) are actually measuring different things. Another possible explanation is errata in the performance counters – since these are non-architectural features, often any errors will go unfixed due the risk of a potential fix breaking something that is architecturally visible.


Figure 8 – L1I Cache Accesses per Instruction Retired

First of all, the K8 L1I accesses/instruction retired is simply wrong for Far Cry at high resolution – there is no way that the front-end is fetching once every 130 instructions (the actual result was 0.007 fetches/instruction retired). The densest possible instruction encoding would be 1 fetch every 16 instructions (since the K8 fetches 16B/cycle and x86 instructions can be as short as 1 byte), but that is impossible in a real application.

Ignoring that issue, it is quite surprising that the Core 2 is accessing the L1I cache roughly 3X as often as the K8. If anything, the Core 2 should access the L1I cache less than the K8 due to the loop cache in the front-end. Frankly, we are rather puzzled by these results. We have spoken with several architects at Intel and confirmed that the Core 2 measurements are correct. Similar discussions with AMD architects were not quite as conclusive, but seemed to indicate that the K8 measurements are correct as well. More than anything else, this appears to be a lesson on the hazards of using performance analysis tools and the associated documentation.


Figure 9 – L1I Cache Misses per Instruction Retired (MPKI)

The traditional way to analyze the performance of a cache is very similar to branch predictors – the metric most engineers use is Misses Per thousand Instructions (MPKI), where lower is better. Intel and AMD take a very different approach to their L1 caches. Intel tends to use smaller and more highly associative caches; in this case the Core 2 is equipped with a 32KB, 8 way associative L1I cache. In contrast, the K8 uses larger but less associative caches – 64KB and 2 way associative.

For these benchmarks, higher associativity seems to be more important than pure capacity. Intel’s caches tend to outperform AMD’s, often markedly so, with a few benchmarks that seem indifferent to cache organization.


Figure 10 – L1I Cache Hit Rate

Figure 10 shows the hit rate for the L1I caches. However, the results for the K8 on Far Cry High have been distorted by the errors in the K8 instruction fetch counting. For example, the hit rates for the K8 under Far Cry High is 54%, which is absolutely wrong. We can firmly guarantee that the K8’s L1I cache has a hit rate >90% for all of these benchmarks.

What is most surprising about the data is that Intel’s L1I cache hit rate is over 99% for all of the workloads, indicating complete coverage of the working set in a rather small (32KB) cache. We suspect that the hit rate for the K8 is comparable, however the errors in measuring the number of instruction fetches for the K8 will distort the hit rate for the L1I cache making this data unreliable.

Pages: « Prev   1 2 3 4 5 6 7 8 9   Next »

Discuss (57 comments)