The Memory System
The memory pipelines and caches in Barcelona have been substantially reworked; they now have some limited out-of-order capabilities and each pipe can perform a 128 load or a 64 bit store every cycle. Memory operations in both the K8 and Barcelona start in the integer schedulers, and are dispatched to both the AGU and the 12 entry LSU1. The address generation takes one cycle, and the result is forwarded to the LSU1, where the data access waits.
Figure 5 – Comparison of Memory Pipelines
At this point, the behavior of Barcelona and the K8 diverge. In the K8, memory accesses were issued in-order, so if a load could not issue, it also stalled every subsequent load or store operation. Barcelona offers non-speculative memory access re-ordering. What this really means is that some memory operations can issue out-of-order.
During the issue phase, the lower 12 bits of the load operation’s address are tested against prior store addresses; if they are different, then the load may proceed ahead of the store, and if they are the same, there may be an opportunity for load-store forwarding. This is equivalent to the memory re-ordering capabilities of the P6 – a load may move ahead of another load, and a load may move ahead of a store if and only if they are accessing different addresses. Unlike the Core 2, there are no prediction and recovery mechanisms and no loads may pass a store with an unknown address.
In the 12 entry LSU1, the oldest operations translate their addresses from the virtual address space to the physical address space using the L1 DTLB. The L1 DTLB now includes 8 entries for 1GB pages, which is useful for databases and HPC applications with large working sets. Any miss in the L1 DTLB will check the L2 DTLB. Once the physical address has been found, two micro-ops can probe (in case of a store) or read from (in case of a load) the cache each cycle, in any combination of load and store. The ability to do two 128 bit loads a cycle is beneficial primarily for HPC, where the bandwidth from the second port can come in handy. Once the load or store has probed the cache, it will move on to LSU2.
LSU2 holds up to 32 memory accesses, where they stay until retirement. LSU2 handles most of the complexity in the memory pipeline. It resolves any cache or TLB misses, by scheduling and probing the necessary structures. In the case of a cache miss, it will escalate up to the L2, L3 or memory, and TLB misses would go the L2 TLB, or main memory, where the page tables reside. The LSU2 also holds store instructions, which are not allowed to actually modify the caches until retirement to ensure correctness. Since all the stores are held in LSU2, it also does the load-store forwarding. Note that stores are still 64 bits wide, hence two entries are used to track a full 128 bit SSE write. This is a slight disadvantage as some instruction sequences, particularly those that involve copying data in memory, have equal numbers of reads and writes. However, the general trend is that there are twice as many (or more) loads than stores in an application.
The 64KB L1D cache is 2 way associative, with 64 byte lines and a 3 cycle access time. It uses a write-back policy to the L2 cache, which is exclusive of the L1. The data paths into and from the L1D cache also widened to 256 bits (128 bits transmit and 128 bits receive), so a 64 byte line is transmitted in 4 cycles. As in the K8, the L2 cache is private to each core. The L2 capacity has been halved to 512KB, but the line size and associativity were kept at 64B and 16 ways respectively.
The L3 cache in Barcelona is entirely new feature for AMD. The shared 2MB L3 cache is 32 way associative and uses 64B lines, but did not fit in Figure 5. The controller for the cache is flexible and various AMD documents indicate that it can flexibly support up to 8MB of L3 cache. The L3 cache is specifically designed with data sharing in mind. This entails three particular changes from AMD’s traditional cache hierarchy. First, it is mostly exclusive, but not entirely so. When a line is sent from the L3 cache to an L1D cache, if the cache line is shared, or is likely to be shared, then it will remain in the L3 – leading to duplication which would never happen in a totally exclusive hierarchy. A fetched cache line is likely to be shared if it contains code, or if the data has been previously shared (sharing history is tracked). Second, the eviction policy for the L3 has been changed. In the K8, when a cache line is brought in from memory, a pseudo-least recently used algorithm would evict the oldest line in the cache. However, in Barcelona’s L3, the replacement algorithm has been changed to also take into account sharing, and it prefers evicting unshared lines. Lastly, since the L3 is shared between four different cores, access to the L3 must be arbitrated. A round-robin algorithm is used to give access to one of the four cores each cycle. The latency to the L3 cache has not been disclosed, but it depends on the relative northbridge and core frequencies – for reasons which we will see later.
The last improvements to Barcelona in the memory pipeline are the prefetchers. Each core has 8 data prefetchers (a total of 32 per device), which now fill to the L1D cache in Barcelona. In the K8, prefetched results were held in the L2 cache. The instruction prefetcher for Barcelona can have up to 2 outstanding fetches to any address, whereas the K8 was restricted to one fetch to an odd address and one fetch to an even address.