The zArchitecture memory modes include 24-bit, 31-bit and 64-bit virtual addressing, providing compatibility back to the 1960’s. Virtual addresses are calculated using 3 way addition of a (register + displacement + index). Despite being a CISC, IBM’s address is thankfully simple and easy to implement in comparison to many other ISAs.
For a microprocessor that is based so heavily on register-memory instructions, the memory pipeline is the most critical element of the microarchitecture. The z10 was fundamentally designed around minimizing the latency for dependent load operations accessing the L1 data cache. The z196 memory hierarchy is clearly descended from its predecessor, but with a host of improvements. The memory subsystems for the two designs are shown in Figure 5.
The z10 memory pipelines are fairly simple and start with address generation using dedicated AGUs. The dual-ported L1 data cache is 128KB, 8-way associative with 256B lines. It is virtually indexed and physically tagged and the L1 DTLB is 2-way associative with 512 entries for 4KB pages. Segments are fragmented into individual 4KB pages in the L1 data and instruction TLBs. The virtual address is sent to the DTLB and in parallel, probes the L1D using set prediction to avoid waiting for address translation. The L2 TLB services any translation misses and efficiently supports the larger 1MB segments. The L2 TLB maps segments with a 512 entry and 4 way associative array; 4KB page translations are held in a 1.5K entry and 12 way array.
The load-to-use latency for the L1D cache is 4 cycles, including address generation, 2 cycles of cache and DTLB access and a cycle for formatting and returning the 64-bit data. The LSU forwards to the ALUs and AGUs with no penalty, due to the close coupling of the pipelines.
The z10 is in-order and any L1D miss will cause a replay. While all accesses are handled in program order, load look ahead is allowed. Specifically, younger cache misses can proceed in the shadow of an older replayed miss to exploit limited parallelism. A total of 6 L1D misses can be in-flight at a given time. The LSU also has hardware and software prefetching to hide latency. The hardware prefetcher is indexed by instruction addresses and can capture physical address strides of 11-bits in a 32-entry history buffer.
The L1D cache is write-through (or store-through in IBM terminology) for reliability. The store queue has 10 entries for addresses, but since stores can be variable length, 64 double words (i.e. 512 bytes) of data can be buffered. The store queue and buffer can forward to subsequent loads, bypassing the cache entirely.
Figure 5. z196 Memory Subsystem and Comparison
In comparison, the z196 is vastly more aggressive and has substantially more memory level parallelism. The schedulers can hold 40 memory accesses. The two load-store unit (LSU) pipelines are 8 cycles long, so a total of 56 memory operations can be in-flight at once. The out-of-order execution also means that loads and stores can execute as they are ready, rather than stalling on earlier operations. The LSU can freely re-order accesses, subject to the incredibly strong memory order model of the zArchitecture. In fact, zArchitecture is perhaps the only ISA with a stronger ordering model than x86.
The z196 executes loads as soon as they are ready, potentially bypassing stores with unknown addresses. To preserve program correctness, the pipeline is flushed if a load is moved ahead of an aliased store that is earlier in program order. The load and any later instructions will re-execute and the load is marked as dependent on an earlier store and will wait before executing to avoid any collisions in the future. A more detailed discussion of this technique is included in an earlier article on Intel’s Core microarchitecture, and Intel and AMD refer to this as memory disambiguation.
The z196 L1 data cache has the same organization and 4-cycle load-to-use latency as the z10, although the total load-store pipeline is 7 cycles deep. However, the z196 DTLB has been substantially redesigned for higher performance. The L1 DTLB contains two translation arrays with 512 entries for 4KB pages and 64 entries for 1MB segments, both are 2-way set associative. This is backed up by a massive L2 TLB that includes a 512 entry, 4-way associative segment array and a 3K entry and 12-way page array. The number of outstanding L1D cache misses stayed at 6 and the z196 store queue can hold 16 different addresses and up to 96 double words (768 bytes) of data and also handles store forwarding.
The L2 caches for the z10 and z196 are relatively similar (although the z10 is often described as having a L1.5 cache). The L2 cache in the z10 is 3MB and 12-way associative. It operates at the nest frequency (half the core frequency) and has an average latency of 14.5 cycles or 3.3ns. The L2 cache is organized into 256B lines and partitioned into two slices. Each slice can independently access and transmit 16B/cycle to the L1, for a sustained read bandwidth of 32B/cycle. The L2 cache is inclusive of the L1I, L1D and coprocessor caches and protected by ECC. It is also write-through, meaning that stores write to the L1D, the L2 and the external L3 cache. There is also a second level store queue that buffers store-through traffic to the L2 cache and a dedicated write bus that is 16B/cycle for long stores and 8B/cycle for other stores.
The L2 cache in the z196 is 1.5MB and 12-way associative. The capacity was reduced from the previous generation to keep the average latency at 14 processor cycles (i.e. 2.7ns) and also to free up die area for the shared L3 cache. The L2 is a single partition with 32B/cycle of bandwidth, which effectively halves the latency of transmitting a 256B cache line to 8 nest cycles. The write bandwidth from the L1D cache is the same as in the prior generation and the L2 is still an inclusive and write-through design. The L2 has ECC protection, although it is only used for error detection. Any error correction is done using the L3 cache.
Discuss (623 comments)