Memory Reordering and Stores
When a load stalls it is sent into the 6 entry reissue queue, where the memory re-ordering occurs. The most common stalls for loads are L1D misses, L1 DTLB misses, load to store aliasing where the forwarding data is unavailable, and misaligned accesses that touch two cache lines. Load operations in the reissue queue will wait until the stall condition has resolved (e.g., the cache line has arrived from the L2 cache). Once ready, loads can replay from the reissue queue out-of-order with respect to other loads. Intel did not disclose the number of fill buffers for the L1D, but the size of the reissue queue is an upper bound on the number of misses outstanding.
Equally important, the scheduler can continue to dispatch new operations even with stalled loads pending in the reissue queue. The 10 entry load queue maintains memory ordering for loads and preserves the appearance that loads complete in-order (a requirement of the x86 memory model). When a load completes out-of-order (e.g., it hits in the cache while an older load is in the reissue queue), it must allocate a load queue entry to hold the load address. The load buffer essentially limits how many cache hits can occur in the shadow of a stalled load.
Store operations are considerably more straight forward than loads. Stores issued into the back-end allocate an entry in the store data buffer and the memory scheduler. The store data buffer is composed of 16 entries that are 128-bits wide to accommodate SSE data; this limits the number of total stores in the back-end. Once a store is dispatched from the memory scheduler, it first calculates the virtual address in the AGU and translates it to a physical address in the L1 DTLB. When the translation is complete, the store takes an entry in the store address buffer to track the physical address and ensure correct memory ordering. There are 8 entries in the store address buffer, which limits the number of dispatched stores.
Any store with an unknown address blocks subsequent memory accesses. The most common cause is a store that misses in the L1 DTLB. In these cases, the store stays in the reservation station and will stall the memory pipeline. Older loads and stores in the pipeline can continue to proceed (e.g., replaying from the reissue queue), but younger ones must wait till the stalled store has resolved. Thankfully these issues are relatively rare, but must be handled correctly.
Once a store has dispatched and the data value is written into the store data buffer, it is mostly finished from the standpoint of the core. The store is ready to retire and the store buffer can forward data to dependent loads. To preserve correct memory ordering, the store will not complete and exit from the store buffer until the data is written to cache or memory. Silvermont and Saltwell both include 8 write combining buffers which merge multiple store operations to the same cache line into a single coherency transaction (which is particularly important for uncacheable stores), and accelerate stores that miss in the cache.
The L1D is backed by a large unified L2 cache that is shared by a pair of adjacent cores. Silvermont’s L2 cache is 1MB and 16 way associative, with a 13 or 14 cycle load to use latency and 32B/cycle bandwidth shared between both cores. In comparison, the minimum latency for the 512KB, 8-way L2 cache for Saltwell is 15 cycles. Since each Silvermont core can only have 6 data cache misses outstanding, it is possible for the memory pipeline to stall waiting for an L2 cache hit, but exceptionally unlikely. The L2 is non-inclusive and non-exclusive relative to the L1D and is a writeback design with full ECC protection and pseudo-LRU replacement. Both Silvermont and Saltwell include a simple next line prefetcher for the L1D. However, the L2 in Silvermont also includes a more sophisticated prefetcher. Intel did not disclose the number of misses outstanding from the shared L2 cache, but it is likely to be over 16 to account for load misses, instruction fetches and prefetching from both cores.
One of the big advantages of the Silvermont memory pipeline is that it sparingly uses expensive data structures. For example, there is no memory disambiguation which requires a prediction structure that is checked by every load. This means that stores with unknown address stall the pipeline and cost performance, but it saves considerable power. The store address buffer and load queue are both snooped on a regular basis, which involves searching the entire structure for a match to an address. Minimizing the size of the load queue and store address buffer greatly reduces the energy wasted. Most out-of-order x86 cores use a load or store buffer entry for each memory access in the pipeline. In contrast, the Silvermont pipeline can theoretically have 16 stores in-flight, with 8 store address entries and about 22 loads with only 10 load queue entries.
Discuss (408 comments)