A Cache Extravaganza
While almost every aspect of Nehalem has been enhanced, the memory subsystem received the most dramatic overhaul – largely because it is very closely coupled with the overall system architecture. Almost every part of the memory system has been refined, improved or otherwise changed to support greater parallelism as shown in Figure 5 below.
Figure 4 – Memory Subsystem Comparison
The changes start early on, as Nehalem increases the number of in-flight loads and stores by 50%. As shown above, the load buffer now holds 48 entries up from 32 and the store buffer increased slightly more to 32 entries from 20 in the Core 2. One of the reasons to enlarge the load and store buffers is that they must be shared by both threads; in this case they are statically partitioned.
From the load and store buffers, memory operations proceed to access the cache hierarchy, which has been totally redone from top to bottom. As with the P4, both the caches and TLBs are dynamically shared between threads based upon observed behavior. Nehalem’s L1D cache has retained the same size and associativity as the previous generation, but the latency increased from 3 to 4 cycles to accommodate timing constraints. As previously mentioned, each L1D cache can support more outstanding misses (up to 10) to take advantage of the extra memory bandwidth.
The remainder of the cache hierarchy in Nehalem is a substantial departure from the design found in Core 2. The last level cache in Core 2 was the L2, which was shared between two cores to reduce coherency traffic and weighed in at 6MB with 24 way associativity and a very low load to use latency of 14-15 cycles. Nehalem’s cache hierarchy has been extended to three levels, with the first two levels staying relatively small and private to each core, while the L3 cache is much larger and shared between all cores.
Each core in Nehalem has a private unified 256KB L2 cache that is 8 way associative and provides extremely fast access to data and instructions. The load to use latency was not precisely disclosed, but Intel architect Ronak Singhal indicated that it was less than 12 cycles. The L2 cache is neither inclusive nor exclusive with respect to the L1D cache and can sustain 16 misses in-flight. Like the Core 2, Nehalem can transfer data between the private caches of two cores, although not at full transfer rates.
Nehalem is the first mainstream Intel processor to pack in a giant shared L3 cache. Plenty of Intel server designs have featured L3 caches; every Itanium and many Xeon MP’s since 2003 (in fact, Potomac in 2004 featured an identically sized L3 cache). Neither AMD nor Intel ever used a three level design for mainstream products until the advent of monolithic quad-core designs, starting with Barcelona last year.
Nehalem’s 8MB and 16 way associative L3 cache is inclusive of all lower levels of the cache hierarchy and shared between all four cores. Although Intel has not discussed the physical design of Nehalem at all, it appears that the L3 cache sits on a separate power plane than the cores and operates at an independent frequency. This makes sense from both a power saving and a reliability perspective, since large caches are more susceptible to soft errors at low voltage. As a result, the load to use latency for Nehalem varies depending on the relative frequency and phase alignment of the cores and the L3 itself and the latency of arbitration for access to the L3. In the best case, i.e. phase aligned operation and frequencies that differ by an integer multiple, Nehalem’s L3 load to use latency is somewhere in the range of 30-40 cycles according to Intel architects.The advantage of an inclusive cache is that it can handle almost all coherency traffic without disturbing the private caches for each individual-core. If a cache access misses in the L3, it cannot be present in any of the L2 or L1 caches of the cores. On the other hand, Nehalem’s L3 also acts like a snoop filter for cache hits. Each cache line in the L3 contains four “core valid” bits denoting which cores may have a copy of that line in their private caches. If a “core valid” bit is set to 0, then that core cannot possibly have a copy of the cache line – while a “core valid” bit set to 1 indicates it is possible (but not guaranteed) that the core in question could have a private copy of the line. Since Nehalem uses the MESIF cache coherency protocol, as discussed previously, if two cores have valid bits, then the cache line is guaranteed to be clean (i.e. not modified). The combination of these two techniques lets the L3 cache insulate each of the cores from as much coherency traffic as possible, leaving more bandwidth available for actual data in the caches.
To be fair, inclusive caches are not the only way of doing this, and they also have some overhead. A non-inclusive cache can achieve the same benefits by simply replicating the tag files for all private caches with the last level cache and simultaneously checking both the last level tags and the private cache tags on any access. Inclusive caches are forced by design to replicate data, which implies certain relationships between the sizes of the various levels of the cache. In the case of Nehalem, each core contains 64KB of data in the L1 caches and 256KB in the L2 cache (there may or may not be data that is in both the L1 and L2 caches). This means that 1-1.25MB of the 8MB L3 cache in Nehalem is filled with data that is also in other caches. What this means is that inclusive caches should only really be used where there is a fairly substantial size difference between the two levels. Nehalem has about an 8X difference between the sum of the four L2 caches and the L3, while Barcelona’s L3 cache is the same size as the total of the L2 caches.
Nehalem’s cache hierarchy has also been made more flexible by increasing support for unaligned accesses. Previous generations of Intel chips always had two instructions for 16B SSE loads and stores – one for data that is guaranteed to be aligned to cache line (16B) boundaries and one that is for unaligned. The latter was avoided by compilers because it ran slower with less throughput (even if the data was actually aligned). Nehalem changes the situation by making aligned accesses and unaligned accesses that touch aligned data the same latency and throughput, and also improving the performance for unaligned accesses which touch unaligned data. As a result, an unaligned SSE load or store will always have the same latency as an aligned memory access, so there is no particular reason to use aligned SSE memory accesses.