The memory hierarchy for Jaguar is conceptually similar to Bobcat. Both feature two pipelines, one for loads and one for stores, with aggressive re-ordering of memory operations. The basic load-to-use latency for the L1D cache is 3 cycles, comprising one cycle for address generation and two cycles for cache access. Jaguar’s bandwidth to the L1 data cache is doubled compared to Bobcat, corresponding to the increase in performance in the FP/SIMD cluster. Jaguar also includes a number of enhancements to store forwarding.
The memory pipeline is out-of-order to the extent that the x86 memory order model permits. At dispatch, any µop accessing memory starts by allocating an entry in the AGU scheduler and the memory ordering queue (MOQ); store µops also reserve an entry in the store queue. Collectively these two structures are responsible for enforcing the correct memory ordering. Bobcat and Jaguar both have aggressive reordering; loads can pass earlier loads and earlier unaliased stores. To move loads around stores with unknown address, a predictor is used, similar to the memory disambiguation in AMD’s Bulldozer or Intel’s CPU cores (e.g., Sandy Bridge).
The AGU scheduler holds 12 entries in Jaguar and 8 for Bobcat. The Jaguar MOQ is 16 entries, compared to 10 entries for Bobcat. The Jaguar store queue is 20 entries that are 16B wide, compared to 22 narrower 8B entries for Bobcat. In practice, the microarchitecture of the Jaguar store queue is much more flexible, which more than makes up for the slightly smaller number of entries.
The oldest ready load and store µops are sent to the respective AGUs for address calculation in the first cycle of the memory pipeline. Virtual addresses for x86 are quite complex and in theory require 4 inputs, 3 additions, and a multiplication: segment_register + base_register + (index * scale) + 32b_offset. In practice, the segment register is only used by VMware and a non-zero base register is uncommon. The AGUs in Bobcat and Jaguar are optimized for simpler addressing and impose a one cycle penalty if either the segment or base registers are non-zero, suggesting that the AGU contains a single adder and multiplier.
Once the virtual address has been calculated, it is written into the memory ordering queue (MOQ). Memory µops then access the data cache and check the store queue for memory ordering constraints. The L1 data cache is 32KB and 8-way associative with 64B cache lines, so that the physical and virtual address only differ in the tags (rather than the index). The writeback cache is parity protected with a pseudo-LRU replacement algorithm. Jaguar’s L1D can sustain a 128-bit read and a 128-bit write each cycle, doubling the bandwidth of Bobcat and also avoiding certain bank conflict scenarios.
The actual data cache access takes two clock cycles. In the first cycle, the virtual address is translated in parallel with checking the data cache tags for a hit. The L1 DTLB is fully associative and holds 40 entries for 4KB pages and 8 entries for larger 2MB pages. While 1GB pages are supported, they are fractured into 2MB pages. The L2 DTLB includes a 512 entry, 4-way associative array for 4KB pages and a 256 entry, 2-way associative array for 2MB pages. L2 TLB misses are speculatively resolved by a hardware page table walker that includes a 16-entry page directory cache for intermediate translations. Stores exit the MOQ once the physical address is available and written into the store queue.
If the tag check determines that a hit has occurred in any of the 8 ways of the L1D, the second cycle will be used to read out the data and format it appropriately. Once data from a load has been delivered to registers or the forwarding network, the load µop has completed and may exit the MOQ and wait in a separate queue till retirement. The load to use latency is 3 cycles, but the bypass network adds two extra cycles of latency for transmitting data to the FP cluster. Misaligned loads within a 16B boundary generate a single cache access, whereas those crossing a 16B boundary suffer at least one extra cycle of latency and half throughput. Unaligned load instructions (e.g., MOVUPD) to aligned data suffer no latency penalties.
Loads check the store queue in parallel to ensure that the x86 ordering model is observed and resolve any store forwarding. Store forwarding is possible for aligned loads and stores that start on the same byte, where the store is at least as large as the load (e.g., an 8B store and 1-8B load; but not an 8B store and a 16B load).
In the x86 memory model, stores commit by writing data to the L1D or memory (for uncacheable data). Store data is held in the store queue and buffered until all earlier instructions have retired. Once a store has completed, the associated store queue entries are freed.
All first level cache misses and prefetches are directed to the L2 cache, which was entirely redesigned between Bobcat and Jaguar. At a high level, the biggest difference is that the L2 cache is per-core for Bobcat, whereas Jaguar uses a much more modern and elegant shared L2 cache design.
Each Bobcat core can sustain 8 outstanding cache miss transactions or prefetch requests. The prefetcher tracks up to 8 different request streams. The instruction cache can generate 2 demand or prefetch requests. The number of outstanding misses from each Jaguar core is relatively similar, except that the instruction cache can generate 3, rather than 2 requests.
The Bobcat L2 cache is a per-core 512KB structure, with 16-way associativity and operates at half the processor frequency. The minimum load-to-use latency is 17 cycles for a hit. The L2 is a writeback design with ECC protection on tags and data. Any L2 misses fill into the original requesting cache as well as the L2, thus it is mostly (but not strictly) inclusive. The L2 complex includes the bus interface to memory that can track 4 outstanding requests, as well as 4 write combining buffers for uncacheable writes.
The Jaguar L2 cache microarchitecture is substantially different and much more efficient at the system level. A group of four cores shares a single L2 cache control block that controls 4 banks of L2 data arrays and acts as the central point of coherency and interface to the rest of the system. For systems with more than four cores, each local cluster must be connected through a fabric.
The Jaguar shared L2 cache is 1-2MB and 16-way associative with ECC protection for data and tag arrays. The minimum hit latency is 25 cycles, which is worse than Bobcat, but the hit rate is substantially higher since the capacity is 4× greater. The L2 is inclusive of all 8 lower level caches (4 data caches and 4 instruction caches), thereby acting as a snoop filter and has a write back policy. The L2 cache is implemented with a single shared interface block and 4 replicated data arrays.
The L2 cache interface operates at core frequency and includes the control logic, tag arrays, and bus interfaces to the cores and rest of the system. The L2 controller can track 24 transactions, each pairing a read and a write. The L2 incorporates a prefetcher for each core and a 16 entry snoop queue for coherency requests. The L2 tags are split into 4 banks with addresses hashed across the banks. Each tag array is associated with a data array, so an L2 tag hit will only check a single bank of the data array for the cache line. This hashing technique reduces the clocking of the tag and data arrays, thereby decreasing power consumption. The data arrays are 512KB each, and run at half core frequency to save power. Each data array delivers 16B per core clock cycle, for an aggregate bandwidth of 64B/cycle – matching the throughput of Bobcat’s private caches, with a substantially higher hit rate.
Discuss (86 comments)