Haswell Memory Hierarchy
The most significant and comprehensive changes in Haswell are all in the memory hierarchy. At a high level, Haswell has twice the FLOP/s of Sandy Bridge. But raw compute power alone is rarely an interesting proposition; to take advantage of the new capabilities, the cache bandwidth for Haswell has also doubled. Moreover, the whole memory hierarchy must be adapted for gather instructions and transactional memory. As with Sandy Bridge, the caches, TLBs and fill buffers are competitively shared by any active threads, however, the load and store buffers are statically partitioned.
Memory accesses start by allocating entries in the load and store buffers, which can track well over 100 uops for Haswell. Ports 2-4 in Sandy Bridge were responsible for memory accesses. Ports 2 and 3 handled address generation units (AGUs) for loads and stores, with port 4 for writing store data from the core to the L1 data cache. Address generation was a single cycle when the (base + offset) is < 2K, with an extra cycle for larger (base + offset) or (base + index + offset) addressing.
Haswell augments these resources by adding port 7 to handle address generation for stores. This is a significant performance boost for all code, since Haswell can now sustain 2 loads and 1 store per cycle under nearly any circumstances. In contrast, Sandy Bridge could only achieve this when using 256-bit memory accesses.
A dedicated store AGU is slightly less expensive than a more general AGU. Store uops only need to write the address (and eventually data) into the store buffer. In contrast, load uops must write into the load buffer and also probe the store buffer to check for any forwarding or conflicts.
Once an address has been calculated, the uop will probe the L1 DTLB. The L1 DTLB in Haswell is the same capacity as in Sandy Bridge. There are 64, 32 and 4 entries respectively for 4KB, 2MB and 1GB pages; all the translation arrays are still 4-way associative. However, the third store AGU requires an additional port on the DTLB. Misses in the L1 DTLB are serviced by the unified L2 TLB, which has been substantially improved. Haswell’s L2 TLB can hold translations for 4KB and 2MB pages, and has 1024 entries that are 8-way associative. In contrast, the Sandy Bridge L2 TLB was the half the size and associativity and only supported 4KB pages.
The data cache itself is 32KB, 8-way associative and writeback. Since the smallest page is 4KB, the TLB access and cache tag check can occur in parallel. The L1 data cache in Haswell is not only higher throughput, but also more predictable than the previous generation. The data cache can sustain two 256-bit loads and a 256-bit store every cycle, bringing the aggregate bandwidth to 96B/cycle. In contrast, Sandy Bridge could sustain two 128-bit reads and a 128-bit write, but the performance was typically worse because there were only two AGUs.
The Sandy Bridge L1D was the same size and associativity, but it was built from 8 different banks that provide 8B of data. An unaligned 16B load can access 3 banks, and bank conflicts will force one of the load uops to retry in a subsequent cycle, reducing the available bandwidth. According to Intel, the L1D in Haswell does not suffer from bank conflicts, suggesting a more aggressive physical implementation, which is especially impressive given that the minimum latency is still 4 cycles, with an extra cycle for SIMD or FP loads.
One advantage of completing 32B loads in a single cycle on Haswell is that the store forwarding for AVX loads is much more robust. Haswell also has 4 split line buffers to resolve unaligned loads, compared to 2 buffers in Sandy Bridge and other enhancements to decrease the impact of unaligned accesses. The forwarding latency for 256-bit AVX loads also decreased from 2 cycles in Sandy Bridge to a single cycle in Haswell.
Haswell and Sandy Bridge track cache misses using 10 line fill buffers for cache lines that are being provided from other caches or memory. The outstanding requests are sent to the unified L2 cache. The L2 cache is a 256KB, 8-way associative and writeback design with ECC protection. The L2 is neither inclusive nor exclusive of the L1 data cache. The L2 bandwidth has doubled for Haswell so that it can provide a full 64B line to the data or instruction cache every cycle, while maintaining the same 11 cycle minimum latency and 16 outstanding misses.
As we speculated, Haswell’s transactional memory retains the write-set and read-set of a transaction in the L1 data cache, rather than using the store buffer. The L2 cache is non-transactional, so if a write-set cache line is evicted, the transaction will abort. Intriguingly, it appears that in some circumstances read-set lines can be safely evicted from the L1 and are tracked using another hardware mechanism. One possibility is a small on-chip transactional victim buffer, or some sort of storage in memory. In the case of an abort, all the write-set lines are flushed from the L1D, while a commit will make all write-set lines atomically visible. On top of the throughput benefits of TSX, the minimum lock latency on Haswell is down to roughly 12 cycles, little more than two dependent L1D hits, compared to 16 cycles on Sandy Bridge.
As previously mentioned, gather instructions are executed as multiple load uops, one for every element of the gather. Each load uop requires a load buffer entry and will access the L1 DTLB and L1 data cache, even in the scenario where there is locality between gather elements. Each cache line that must be fetched from the L2, L3 or memory to satisfy the gather will consume a line fill buffer. Over time the implementation should become substantially more efficient, e.g. reducing the number of TLB or cache accesses for gather instructions with good locality.