The load-store pipelines are the most complicated part of Itanium microprocessors. All the resources that were saved with VLIW and static scheduling were poured into an incredibly high performance cache hierarchy. To some extent, the cache hierarchy was over-designed for workloads outside of technical computing. There are few server workloads that fully utilize Tukwila’s caches, with the possible exception of analytic databases.
Tukwila has an incredible 4 load/store pipelines tightly integrated with the cache and TLB hierarchy to achieve low latency and high bandwidth. The L1D cache and L1 DTLB are only used for integer load instructions, while all stores and floating point loads rely on the L2 D-cache. This is a great example of microarchitecture and circuit co-design with impressive results. The overall cache system is quad-ported, with single cycle latency for integer loads and high bandwidth for floating point data accesses. Only the first two of the memory pipelines can access the L1 D-cache, although they can also issue FP loads to the L2D. The second set of memory pipelines are specialized for integer stores and any FP memory accesses; they generally interface with the L2D.
Figure 6 – Poulson Memory Subsystem and Comparison
The two load AGUs calculate an address using register-indirect with optional post-increment. Integer loads access three structures in parallel: the dual-ported L1 DTLB for address translation, the L2 DTLB for checking access privileges and the L1D for data. The L1 DTLB is only probed for integer loads; each entry translates a 4KB page (or a sub-page of an 8KB/16KB page). The larger quad-ported L2 DTLB holds translations and protection information for any page size (4KB to 4GB) and access type. Software can manage half of the entries in either DTLB.
The latency optimized L1D cache is 16KB, 4-way associative with 2 read ports and 2 write ports. Cache lines are 64B, with a write-through and write no-allocate policy. Any stores must also write to the L2D, and store misses only write to the L2D – thus the cache only has valid and invalid coherency states. The cache is virtually indexed and uses a pre-validated tag instead of a conventional physical tag. Integer loads take an amazingly low single cycle to execute, including address generation, translation and data cache access.
The second set of AGUs is specialized for FP loads and all stores; they directly accesses the L2 DTLB and L2 D-cache. After translation, any FP loads, stores or L1D misses are sent to the L2D. Integer stores also write the 24-entry store buffer, which drains into the L1D. The L2D is out-of-order with a 32-entry queue for outstanding cache accesses. The L2D queue can initiate 4 memory requests per cycle to the L2 D-cache arrays and the queue can simultaneously fill into the L1D, the FP register file and write to the data arrays.
The L2 D-cache is 256KB, 8-way associative with larger 128B lines and neither exclusive nor inclusive of the L1D. The minimum latency is 5 cycles for integer loads and an extra cycle for FP loads to format the data. It is ECC protected and write-back to the L3. The L2D is psuedo quad-ported, the data array has 16 banks; each bank can read or write 16B per cycle. At peak bandwidth, the L2D arrays read 64B for loads, 128B for filling the L1D and write 64B for stores. The L2D has full MESI coherency semantics, as does the unified L3.
The Itanium ISA also includes a special structure called the Advanced Load Address Table (ALAT). Speculative loads create an entry in the ALAT (Tukwila has a 32-entry ALAT per thread), which are checked when the load data is consumed. All stores check the ALAT for conflicts. If no conflict is detected, the data comes from the ALAT with 0 cycle latency – otherwise, the load is re-issued. In essence, this is a software controlled version of the Core 2’s memory disambiguation.
Poulson’s cache hierarchy was glossed over at ISSCC and remains somewhat of a mystery, although Figure 5 shows some information. There are only two AGUs and memory pipelines in Poulson, but these can probably issue any memory operation. The L1 and L2 DTLBs have been replicated, presumably with a copy per thread. The L2 DTLB may have as few as two ports, to match Poulson’s two AGUs, although depending on the replay mechanism, there might be reasons to maintain the full 4 ports. Duplicating the DTLBs reduces contention and substantially improves throughput – especially for server workloads with a huge data footprint. The L1 D-cache is still 16KB, 4-way associative with single cycle latency and two load plus two store ports.
Poulson’s L2 D-cache has been completely rearchitected to improve performance, although the details are scarce. Since there are two threads simultaneously sharing the L1D, the memory accesses in-flight potentially doubles (up to 64) and the L1D miss rate will substantially increase. The L2 D-cache in Poulson is still heavily banked and now uses 64B lines with a slightly longer 8 cycle minimum latency. Filling an L1D miss takes half the bandwidth and fewer cycles versus Tukwila because of the smaller cache lines. The L2D ordering queue and store buffer will be expanded to handle more memory accesses. Moreover, the L2D will sustain higher bandwidth into the core and L1D cache, to satify requests from both threads. It would not be surprising to see the L2D ordering queue and cache arrays double the fill rate into the L1D to 64B/cycle. The bandwidth to the FP register file might increase as well, although this seems less critical. Additionally Poulson sports a hardware data prefetcher that probably brings data into the L2D, but the details are unknown. The ALAT is replicated per thread, and likely the same size as Tukwila’s but with fewer ports.
There is a theory that Poulson’s L1 and L2 data caches have been replicated for each thread, which would improve throughput substantially by eliminating contention. The data caches appear unusually large in die photos, when compared to the instruction caches. However, this is an outside chance at best and subsequent presentations at Hot Chips should reveal the truth of the matter.
On paper it seems like Poulson has a weaker cache hierarchy, with only 2 memory pipelines instead of 4. However, that is a simple analysis which does not take into account Poulson’s dynamic scheduling. Tukwila was specifically designed with 4 memory pipelines so that two bundles with 4 memory instructions could be simultaneously decoded, issued and executed. However, it is very uncommon for server workloads to need that level of sustained performance. Poulson’s new instruction buffers can readily decode 4 memory instructions per cycle. The dynamic scheduling in the memory queue can then issue accesses over several cycles without pipeline stalls. The only case where performance would decrease is for carefully tuned loops that sustain 4 accesses per cycle, like those in scientific computing (which is no longer a target market). The more sophisticated issue logic in Poulson actually simplifies other parts of the pipeline, and highlights a key benefit of dynamic scheduling and out-of-order execution.
One of the other big changes in Poulson is the L3 cache, but that discussion properly belongs in the system interface section.
Discuss (208 comments)