Memory Scheduling and Loads
Along with the new out-of-order microarchitecture in Silvermont, the memory pipeline has been reworked for much higher sustained performance. Intel’s architects used a number of clever techniques to sustain high throughput to the caches, but with minimal power and area overhead. In particular, Silvermont is still a single memory pipeline design, and cannot dispatch multiple memory operations per cycle, which is a very power efficient design point for x86. However, Silvermont can reorder loads to tolerate stalls and satisfy multiple accesses simultaneously. Consequently, Silvermont uses the memory pipeline much more efficiently than Saltwell and comes much closer to achieving peak performance on a wide variety of code. At the same time, the overhead is substantially less than a highly aggressive memory system like Haswell.
Load and store operations are handled differently by Silvermont, reflecting the x86 memory ordering model. All memory accesses are allocated into the 6 entry memory scheduler shown in Figure 6, but require different resources for tracking out-of-order execution, retirement and completion. Note that retirement refers to when an instruction is finished in the pipeline, whereas completion means that a memory transaction is finished; the two are distinct.
The memory scheduler actually dispatches accesses in program order, but is generally a non-blocking design. The memory re-ordering comes later in the pipeline via the reissue queue. This simplifies the structure of the scheduler to a FIFO buffer, reduces the number of address comparisons, and avoids expensive speculative techniques such as memory disambiguation. In some cases, memory accesses that are not quite ready to execute are sent directly to the reissue queue, so that the scheduler can continue to make forward progress.
Once a load is issued from the instruction queue to the out-of-order machine, it allocates a destination rename register and a scheduler entry. When the load operation is dispatched from the scheduler, the first stop is the address generation unit (AGU), which calculates the virtual address. Afterwards, the virtual address is translated to a physical address in the data translation look-aside buffer (DTLB).
The Silvermont L1 DTLB is a fully associative structure with 48 entries that cache virtual to physical translation for 4KB pages. This is effectively 3× the size of the L1 DTLB in Saltwell, which uses a fully associative 16 entry TLB that is replicated for each thread. Misses to the L1 DTLB are serviced by the L2 DTLB, which is significantly larger in Silvermont; the structure is 4-way associative, with 128 entries for 4KB pages and 16 entries for large 2MB pages. In contrast, the L2 DTLB for Saltwell is also 4-way, but with only 64 entries for 4KB pages and 8 entries for 2MB pages.
The L1 data cache in Silvermont is fairly similar to Saltwell. The data array is 24KB and 6-way associative with 64B cache lines. The cache is writeback, with a pseudo-random replacement policy. Since the minimum page is 4KB, the DTLB access and L1D tag check can occur in parallel. The L1D is physically implemented with 8T cells and parity protection, rather than ECC. To sustain multiple accesses, it is organized into 16 banks that each read 4B of data per cycle. The L1 data cache can service a 16B load from the scheduler or the reissue queue and a 16B store from the store data buffers simultaneously. However, this throughput cannot be sustained consistently because there is only a single operation dispatched per clock. Silvermont also substantially improves misaligned 128-bit accesses to native levels.
Load operations also probe the store buffer to detect any aliasing. If the load address matches with any pending stores, then the store buffer can forward data directly, bypassing the data cache and saving latency and power. Store forwarding is a major area of improvement for Silvermont, which can forward integer or FP data as long as the load and store have the same starting address and the store fully overlaps the load. In contrast, Saltwell can only forward data to integer instructions, not SSE or x87.
Discuss (408 comments)