Perhaps the most profound change in Bulldozer is the load-store pipeline and caches. Other sections of the chip were rearchitected for efficiency or for modest performance gains. However, the load store units were totally redesigned and improved across the board. In tandem, the inner portions of the cache hierarchy have been redone, and for the first time, AMD is fielding competitive prefetchers.
The memory pipeline for each Bulldozer core starts with the load and store queues and the integer scheduler. Any loads or stores in flight must be allocated an entry in the appropriate memory queues. This is necessary to maintain the relatively strong x86 memory ordering model. Previously, Istanbul had a somewhat complex two level load-store queue, where different functions were performed in each level. Bulldozer has a conceptually simpler microarchitecture with a separate 40 entry load queue, and a 24 entry store queue. In total, this means that each Bulldozer core can have 33% more memory operations in flight compared to the previous generation and about 20-30% less than Nehalem or Westmere.
Figure 6 – Bulldozer Memory Subsystem and Comparison
The scheduler feeds memory operations into the two AGUs responsible for address generation. While this is a decrease from the prior generation, there are reasons to suspect this may not be catastrophic. The original K8 had a totally in-order memory pipeline, while Istanbul had a non-speculative out-of-order memory pipeline – loads could only move ahead of stores known to have a different address. Bulldozer improves this further with a dependence predictor that will determine when loads can speculatively pass stores. This latter technique is referred to as memory disambiguation by Intel and first showed up in the Core 2 Duo. Second, some macro-ops are of the form ‘load-op-store’ and only do address generation and translation for the load, and re-use that work for the ending store. Third, it’s possible that for 256-bit AVX instructions, the address generation is only done once, and that the second macro-op simply adds a 16B offset to the address of the first macro-op.
Once a memory access is ready to proceed, it probes the 32-entry, fully associative L1 DTLB for address translation. Misses in the DTLB will access the 1024-entry, 8-way associative L2 DTLB, and both TLBs can cache any combination of 4KB, 2MB and 1GB pages. For the L2 DTLB, this is a big improvement in both size and flexibility over Istanbul, where the L2 DTLB could hold 512 4KB pages, 128 2MB pages and 16 1GB pages.
The cache hierarchy has also taken a radical change with Bulldozer and to fully understand the architecture it is necessary to look at both the L1D and L2 caches. Bulldozer’s 16KB L1D cache is a 4-way associative, write-through design with 64B lines and a 4 cycle load-to-use latency. It is virtually indexed and physically tagged, so that the TLB look up can proceed in parallel with checking the index. Normally in a 4-way cache, an access must check 4 different locations (one per way) simultaneously to find the requested data. Bulldozer’s L1D uses way prediction to save power by predicting which of the 4 ways will contain the data and checking that way first. The downside is that a misprediction costs one or more cycles of added latency (AMD did not disclose this penalty).
Bulldozer’s load-to-use latency increased by one cycle over Istanbul, and is now identical to Nehalem’s. This change in timing reflects AMD’s high frequency targets and high bandwidth requirements for the L1D cache. The cache is banked for simultaneous accesses, although the arrangement was not disclosed. Given that three accesses are possible each cycle, 8 banks seems too small, 32 banks too large, and 16 banks – that sounds just right. The L1D is theoretically capable of two 128-bit loads and one 128-bit store per cycle (48B/cycle), although bank conflicts may reduce the available bandwidth. Moreover, the L1D cache cannot sustain three independent accesses per cycle since there are only two AGUs. However, the extra port is beneficial for clearing out queued up operations and probably also plays a critical role for executing 256-bit AVX instructions. In the context of AVX, Bulldozer probably has equivalent cache throughput to Sandy Bridge – undoubtedly one of the major design targets. The L1D bandwidth is a substantial step forward for Bulldozer, as Istanbul was limited to either 2×128-bit loads or 2×64-bit stores or one of each.
Bulldozer’s L2 cache is shared between the two cores in a module and is mostly inclusive of the L1D caches (recall that the L1D is write-through). The size is implementation dependent and early versions will be either 1MB or 2MB. Open64 compiler optimization notes indicate that Interlagos probably uses a 2MB and 16-way associative design. The load-to-use latency for Bulldozer is surprisingly high: 18-20 cycles, again reflecting a focus on high frequency. In comparison, the L2 caches for Nehalem and Istanbul are roughly 10 cycle latency, although the capacities are smaller (256KB and 512KB respectively). The L2 cache can have as many as 23 outstanding misses concurrently, which is a somewhat peculiar number compared to the usual powers of 2. This suggests that some outstanding miss requests may be dedicated for certain purposes. For example, there might be 8 misses outstanding for each L1D cache, with the remainder for use by the L1I cache and prefetchers.