Since Sandy Bridge can potentially double the FP performance compared to Nehalem, the memory pipeline must substantially improve to keep pace and feed the execution units. This necessitated a redesign of the memory pipeline, particularly the elements closest to the core. When the Sandy Bridge architects were evaluating how to improve the memory pipeline, one of the key goals was to ensure that the changes would benefit all software, not just AVX. Even though the performance for AVX should be good enough to encourage adoption, it will still take time to penetrate the software world. Given these constraints, and the focus on power efficiency, the design point is fairly straight forward.
To start with, the number of memory operations in-flight increased dramatically, in tandem with the overall instruction window. The load buffer grew by 33% and can track 64 uops in-flight. Sandy Bridge’s store buffer increased slightly to 36 stores, for an overall 100 simultaneous memory operations, roughly two thirds of the number of the total uops in-flight. To put this in perspective, the number of memory uops in-flight for Sandy Bridge is greater than the entire instruction window for the Core 2 Duo. Again, like Nehalem, the load and store buffers are partitioned between threads.
Figure 6 – Sandy Bridge Memory Subsystem and Comparison
In Nehalem, there are two address generation units. One is dedicated for load uops (on port 2), while the other can only be used for store uops (on port 3). As another example of re-using existing resources in a more efficient fashion, Sandy Bridge retains the two address generation units. However, they are fully flexible now and each AGU can be used for loads or stores, effectively doubling the load bandwidth. This is a huge benefit, since many workloads can take advantage of two loads per cycle, whereas a 1:1 ratio of loads to stores is relatively unusual.
After address generation, uops will access the DTLB to translate from a virtual to a physical address, in parallel with the start of the cache access. The DTLB was mostly kept the same, but the support for 1GB pages has improved. Previously, Westmere added support for 1GB pages, but fragmented 1GB pages into many 2MB pages since the TLB did not have any 1GB page entries. Sandy Bridge adds 4 dedicated entries for 1GB pages in the DTLB.
Sandy Bridge’s L1D cache was redesigned to maintain the same low latency for existing workloads, while increasing bandwidth to match the compute performance gains when using AVX. Sandy Bridge’s 32KB L1D is 8-way associative, meaning that it is virtually indexed and physically tagged, and uses 64B lines and a write-back policy. The load-to-use latency is steady at 4 cycles for integer uops (due to bypass, the latency for FP and SIMD is 1 or 2 cycles more). The L1D cache can sustain two 128-bit loads and a 128-bit store every cycle, a 50% increase in bandwidth. Accordingly, the L1D cache has multiple 8 byte banks for simultaneous accesses. When a bank conflict occurs, one of the two memory operations will be delayed a cycle and a state machine will record the conflict and adjust scheduling in the future to prevent subsequent bank conflicts. Both load ports are equipped to efficiently handle misaligned memory accesses, store forwarding and memory disambiguation. As with Nehalem, there are 10 outstanding misses from the L1D cache to higher levels of the memory hierarchy.
Like Bulldozer, Sandy Bridge cannot sustain the full bandwidth of the cache with 128-bit accesses because there are three data accesses per cycle to the L1D, but only two issue ports for addresses and AGUs. Servicing three data requests per cycle can help to clear out some queued operations, but the main benefit is when executing 256-bit AVX code, where the extra bandwidth is essential to sustaining a 2:1 ratio of loads and stores.
According to Intel, 256-bit memory accesses execute as a single uop, thus they only use one entry in the ROB, PRF and memory order buffers. However the actual data accesses are 128-bits wide, so a clever approach is needed to have a 256-bit load execute as a single uop. The most likely technique is described in an Intel patent application and suggests using the 1 cycle bypass delay to guarantee that both halves of a 256-bit load complete in the same cycle. The high half would be initiated in the first cycle, doing address calculation, a TLB lookup and checking the cache tags for a hit, and the 128-bit high half would reach the FP execution stack 5 cycles later. The low half would subtract 16B from the address of the high half and start in the second cycle, with the result reaching the SIMD execution stack 4 cycles later. Thus both 128-bit accesses would arrive in the same cycle and complete together as a single 256-bit uop. A variation of this might also work for stores, but the general approach only works for aligned accesses. Crossing a cache line or page would complicate the situation by requiring a separate tag check for a hit or a separate TLB look up.
AVX does not have scatter/gather like a GPU or Larrabee, because the area, power and performance impact would be quite substantial and compete with the key goal of low power and low latency for the L1D cache. However, AVX has conditional masking for each lane of vector loads and stores. Suppressed lanes in a memory operation will not raise any faults (e.g. page faults or protection violations). This is much less overhead than scatter/gather, since much of the logic is already present for handling misaligned data or store forwarding. AVX also includes broadcast instructions to all lanes of a vector.
The unified, per-core L2 cache in Sandy Bridge is mostly carried forward from Nehalem. It is a 256KB, 8-way associative design with a 12 cycle load-to-use latency (versus 10 cycles in Nehalem). The L2 has a non-exclusive/non-inclusive relationship to the L1 caches and is a write-back design. The bandwidth and outstanding misses for the L2 cache are the same as for Nehalem – delivering 32B/cycle, with 16 outstanding misses.