Performance Analysis for Core 2 and K8: Part 1

Pages: 1 2 3 4 5 6 7 8 9

Data Cache

The next important portion of the two cores which are comparable are the load/store pipelines and the data caches. Before we actually dwell on the data caches, it is worth looking at the impact of misaligned memory accesses. This is particularly interesting because in the next generation microprocessors (Nehalem and Barcelona) both Intel and AMD have no performance penalty for unaligned loads and stores thanks to various microarchitectural tweaks.


Figure 11 – Misaligned Accesses per Instruction Retired

A misaligned data access is when an N-byte piece of data is stored starting at an address that is not divisible by N (or 0). For instance, loading an 8 byte (64 bit) piece of data starting at byte 3 in memory would be considered a misaligned load. The original x86 architecture freely allowed misaligned accesses, but SSE2 instructions require that data be aligned on 16 byte boundaries.

Generally, misaligned accesses are fairly infrequent, although the K8 measurements for Far Cry seem to be exceptionally high and possibly incorrect. Games are generally well behaved, whereas other code that deals with networking, storage or motion compensation are much worse because the data types or algorithms tend to pack data as tightly as possible, rather than spacing data out so it is properly aligned. Unfortunately we do not have any results from such software to compare to these games.

Figure 12 below shows the number of cache accesses per instruction retired. It is important to keep in mind that x86 instructions are not strictly register to register; an x86 instruction can source an operand from memory, which implicitly produces a load or store. The data suggests that slightly over half of instructions either use a memory address to source an operand or are loads or stores, which seems quite reasonable.


Figure 12 – L1D Accesses per Instruction Retired

As an aside, the K8 measurements for Far Cry High are incorrect and should be disregarded. Unfortunately, this makes it impossible to correctly calculate the L1D hit rate for Far Cry as well. There also appears to be some irregularities in Far Cry Low when running on the K8 – it is very hard to believe that 75% of instructions would require a memory access.

Intel and AMD use the same design for both their L1D and L1I instruction caches, so the differences are the same as previously mentioned. Intel’s caches are 32KB and 8 way associative, while AMD’s L1D is 64KB and 2 way associative.


Figure 13 – L1D Misses per Instruction Retired

AMD’s L1D cache tends to outperform Intel’s by around 2 MPKI due to the size advantage, with the exception of Prey. Prey seems to prefer more associative caches, which indicates a smaller working set that is slightly more spread out, perhaps indicating more complicated code and data structures.

A prominent computer architecture rule of thumb is that doubling the size of a cache improves the miss rate by the square root of two. The data above seems to confirm this rule of thumb – the ratio of Intel/AMD MPKIs is about 1.3-1.5, except for Prey.


Figure 14 – L1D Hit Rate

The last figure concerns the hit rate for the L1D caches. The result for the K8 under Far Cry High is incorrect, again stemming from incorrect measurements of the number of cache accesses, which artificially deflates the hit rate. The observed hit rates are surprisingly high – almost all are 98% or greater, and all are above 97%.

Of course, the L1D cache is only one part of the cache hierarchy, to see the fully story we must look at the L2 cache.

Pages: « Prev   1 2 3 4 5 6 7 8 9   Next »

Discuss (57 comments)