Data Cache: Does Size Matter More Than Speed?
Even if we accept the proposition that data memory latency is crucially important for good x86 performance, we are still left with the question of whether a 2.2x higher miss rate is worth eliminating that 3rd cycle of data cache latency. That is because an access that misses the data cache causes the processor to next search the level 2 (L2) cache, which can take 5, 6, or more clock cycles. The worst case scenario is that the data requested isn’t in the L2 either. In this case our revved up processor running well over 1 GHz has to send a memory read command over its system bus to the chipset which in turn performs a read cycle to main memory. While DRAM data transfer rates have increased dramatically over the last 5 years (although more slowly than processor clock rates) with the advent of SDRAM, Direct Rambus, and DDR memory technologies, the sad fact is that memory latency is essentially the same (or in the case of rambus perhaps even a bit higher). A memory read cycle might not return the requested data to the processor for 100 ns or more on average. For a 1.5 GHz Willamette processor that represents more than 150 processor cycles.
But the important thing to remember is that the Willamette uses an inclusive cache design. That is, every piece of data in the data cache also has a duplicate copy in the L2. As a result a larger data cache does not help reduce the miss rate of the L2 cache. The fraction of CPU data requests that have to go off-chip to DRAM is set by the L2 size and organization and would be the same if the data cache were 4 KB or 64 KB.
In Table 1, I have put together some estimates for the cache and memory system performance of a number of x86 processors including speculative numbers for Willamette, and Mustang, AMD’s next version of the K7/Athlon design that apparently includes a 1 MB on-chip L2 cache.
Pentium III |
Pentium III |
Pentium IV (estimated) | |
Clock Rate (MHz) |
600 |
1000 |
1500 |
L1 Dcache Size (KB) |
16 |
16 |
8 |
L1 Dcache Associativity |
4 way |
4 way |
4 way |
L2 Cache Location |
Off Chip |
Integrated |
Integrated |
L2 Cache Size (KB) |
512 |
256 |
256 |
L2 Cache Associativity |
4 way |
8 way |
8 way |
L1 load-use latency (cycles) |
3 |
3 |
2 |
L2 access time (cycles) |
18 |
6 |
5 |
Bus transfer rate (MHz) |
100 |
133.3 |
100 x 4 |
DRAM access time (cycles) |
72 |
115 |
180 |
L1 hit rate (estimated) |
97.1% |
97.1% |
96.1% |
L2 global hit rate (estimated) |
99.2% |
98.9% |
98.9% |
Avg data access (cycles) |
4.11 |
4.44 |
4.18 |
Avg data access (ns) |
6.84 |
4.44 |
2.78 |
Athlon |
Athlon |
Athlon (estimated) | |
Clock Rate (MHz) |
700 |
1000 |
1500 |
L1 Dcache Size (KB) |
64 |
64 |
64 |
L1 Dcache Associativity |
2 way |
2 way |
2 way |
L2 Cache Location |
Off Chip |
Integrated |
Integrated |
L2 Cache Size (KB) |
512 |
256 |
1024 |
L2 Cache Associativity |
2 way |
16 way |
16 way |
L1 load-use latency (cycles) |
3 |
3 |
3 |
L2 access time (cycles) |
22 |
11 |
9 |
Bus transfer rate (MHz) |
100 x 2 |
100 x 2 |
133.3 x 2 |
DRAM access time (cycles) |
84 |
120 |
173 |
L1 hit rate (estimated) |
98.2% |
98.2% |
98.2% |
L2 global hit rate (estimated) |
98.6% |
99.0% |
99.5% |
Avg data access (cycles) |
4.57 |
4.40 |
4.03 |
Avg data access (ns) |
6.53 |
4.40 |
2.68 |
The cache hit rates mostly come from [1]. However those figures were derived from studies of the VAX architecture, a CISC design with twice as many GPRs as x86. Because the x86 would use memory more frequently for local variables, it would lead to the conclusion that that data cache hit rates for x86 would be higher than for VAX. Comparing the same cache organization in both [1] and [2], the latter a study of Pentium Pro performance characteristics, suggests that this effect is present and that x86 has about 35% lower miss rate for integer programs. All of the L1 miss rates in table 1 were reduce by 35% from the values in [1] to account for the bias effect between VAX and x86. The L2 miss rates were left unchanged because there was little difference in L2 miss rates between [1] and [2]. This seems to agree with the common sense idea that the extra burden of memory accesses due to the disparity in the number of GPRs will effectively be fully absorbed by the L1 cache. The average memory latency in Table 1 was calculated with the assumption that all systems (MPU + chipset) averaged 100 ns for DRAM access plus an extra overhead of 2 system bus clock cycles.
Let’s start by examining the latency difference between the Pentium III Coppermine and the Willamette. According to these estimates the average data access latency of Willamette in absolute time is only 62% of that of Coppermine. This is very close to information disclosed during the recent IDF that a 1.4 GHz Willamette will have an average data access latency 55% that of a 1 GHz Pentium III. The impressive thing is that the Willamette seems to have a shorter average data access latency than Coppermine and T-bird in processor clock cycles despite being clocked 50% faster. This is significant because code running on the register poor x86 architecture performs one data memory access for every two instructions executed on average. That is about two orders of magnitude more frequently than Willamette will mispredict the direction taken by a conditional branch instruction.
I have also included an estimate of the memory performance of a hypothetical AMD K7 Mustang processor equipped with a 1 MB on-chip cache in Table 1. The Mustang and Willamette seem to achieve similar average memory latency despite taking two very different routes to get there. Although a bit behind in latency, Willamette uses an L2 cache ¼ the size of the Mustang L2 cache. That represents a substantial saving in chip area that is made available for other functions. The harder question to answer is what potential impact does the 2 cycle data cache have on the ability of Intel to scale up the clock frequency of the Willamette microarchitecture. To answer that I will look at how the AMD K7 L1 data cache operates and then speculate on how Intel may have gone about achieving a data cache latency of 2 clock cycles, or 1.0 ns at 2 GHz.
Pages: « Prev 1 2 3 4 5 6 7 8 Next »
Be the first to discuss this article!