While the microprocessor cores changed dramatically in the latest generation of mainframes, there were other significant improvements to the z196. Some of the biggest differences lie in the cache and memory hierarchy outside of the CPU cores. This article will not delve deeply into the overall mainframe system architecture, but focus on the outer levels of the cache hierarchy.
The z10 system architecture includes three levels of caches. The L1 and L2 are on-die and private per-core, while the L3 cache is off-chip and shared. Each MCM contains 5 processors and 2 storage controllers (SC). The memory controllers and GX++ I/O controllers are integrated into the z10 die itself. However, the L3 cache and external coherency interfaces are implemented in the SCs.
The L3 is 48MB, 24-way associative, write-back and fully inclusive of the L1 and L2 caches. Data is address partitioned in the on-die L2 caches and in the two SCs, so that each slice of the L2 cache is paired with one SC and the address partitions can be independently managed. The 24MB partition in each SC has two pipelines, so the entire L3 cache can serve 4 accesses simultaneously. The z10 CPUs have two bi-directional links to connect with the SCs in the MCM. These links run at 2/3 the core frequency and can simultaneously send and receive 8B per cycle, so each z10 processor has an aggregate bandwidth of 46.9GB/s for reads and 46.9GB/s for writes to the two SCs.
The SC also manages coherent communication with the local processors, memory and other MCMs. Coherency is enforced using an optimized variant of the MOESI protocol with 15 states, including some for I/O. The z10 introduced snoop filters to the L3 cache, which track whether any local L1s or L2s are holding a given line to reduce snooping. The SCs have 3 bi-directional links for coherency that connect to other MCMs. The links are 16B and transmit at 1/3rd core frequency, with a total bandwidth of 70.3GB/s for each SC.
The z196 brings the massive L3 cache on-die and into the microprocessor, taking advantage of IBM’s unique deep trench capacitor eDRAM. This has significant implications for performance because it eliminates off-die write-through traffic, which is a tremendous performance bottleneck. The z196 L3 cache is 24MB and 12-way associative, shared by all four cores with a uniform 40 cycle (7.7ns) read latency. The L3 cache is inclusive of all lower levels (L1, L2 and co-processor) and is write-back. The L3 is implemented in eDRAM and operates at one quarter the core frequency, or half the nest frequency.
The L3 cache is designed for massive bandwidth, since it must service all L2 misses from the four cores and all writes in the system. There are two independent 12MB slices of the L3, and the 256B cache lines are hashed between the slices. Each slice contains 48K cache lines that are organized into 12 sets of 4K lines. The slices are further banked for bandwidth and cache lines are partitioned up into 8 sectors. A slice contains 8 interleaves, which are 32B wide and hold one sector for each of the 4K cache lines. A cache line read starts at the critical sector first and then sequentially accesses the remaining 7 interleaves. Since the L3 handles all stores, it must have even higher concurrency. Each interleave is actually split into two banks (or sub-interleaves) of 2K sectors, which can service concurrent accesses. Each sub-interleave is physically implemented as a 64KB eDRAM macro, so the entire L3 cache requires 384 macros.
While eDRAM is more resilient to soft errors and denser than SRAM, that comes at a cost. The eDRAM for the z196 can only service a single read access every 12 processor cycles. This restriction may also be related to all the store traffic that the L3 must absorb, due to the write-through L1 and L2 caches. In aggregate, the L3 cache read bandwidth across the 16 interleaves is 512B per cycle, or 221.8GB/s.
The L3 slices in the z196 have dedicated 16B read and 16B write buses to communicate with each of the four private L2 caches, for a total of 128B read and 128B write. These buses operate at the nest clock, which is half the core frequency and works nicely with the interleaves in the L3 cache. Reading out a 32B interleave takes two nest cycles, and it is transmitted to the L2 in two nest cycles a well. Accordingly, transferring a cache lines from the L3 to the L2 takes 16 nest cycles (or 32 core cycles), and cannot be interrupted.
The actual read bandwidth from the L3 to the cores is 332.8GB/s, which is 50% more than the L3 data arrays can provide. The extra communication bandwidth avoids contention between data fetched from the L3 cache and data fetched from the external L4 cache or memory. The write bandwidth is similarly large, but is necessary to absorb the store traffic from the four cores.
The MCM for the z196 contains 6 microprocessors (24 cores total) and an inclusive eDRAM L4 cache that is implemented in two SCs. The L4 cache is 192MB, 24-way associative and extensively banked. The SCs also contain 3 bi-directional coherency links and systems can contain 1-4 MCMs. The z196 incorporates the GX++ I/O fabric and 5 channels of ECC protected DDR3 memory. A new reliability feature in the z196 is that the fifth channel is dedicated for parity, conceptually akin to RAID for the memory. In a given MCM, only 3 of the 6 microprocessors will have memory controllers active, so there is a total of 12 data channels available. Note that all memory requests are initiated by the SC due to the strong memory ordering model in zArchitecture. While the memory controllers are integrated into the z196, this does not confer the same performance advantage found in other architectures.
Discuss (621 comments)