The z196 is a return to the earlier mainframe days. The S/360 Model 91 FPU was the first to implement Tomasulo’s eponymous algorithm and later models were fully out-of-order. Ironically, the first 4 generations of 64-bit mainframes were all high frequency, in-order processors, including the most recent z10. The z10 was a cousin to the POWER6, and shared many building block and overall architectural directions. One of these shared design points was an in-order integer core aimed at frequencies over 4GHz. In fact, the z10 was totally in-order (whereas the POWER6 had out-of-order floating point), which is logical given the relative importance of integer and floating point workloads. The biggest change for the z196 is moving from a dual-issue in-order, replay design to a three-issue fully out-of-order pipeline.
The out-of-order execution is a little different for the z196 compared to other implementations of CISC instruction sets. Tracking and completion is based on groups – akin to the POWER4-7 or AMD’s K8 and Barcelona, and the groups also determine some later aspects of scheduling and execution. In contrast, Intel’s x86 designs from the P6 up to Sandy Bridge and AMD’s Bulldozer track uops individually. Figure 3 shows the scheduling for the z196, compared to the previous generation z10.
Figure 3. z196 Instruction Scheduling and Comparison
Each cycle, the front-end sends a single group to the out-of-order back-end for execution. New groups allocate an entry in the Global Completion Table, to maintain the correct program order, track status and eventually retire. Each entry can hold 3 uops and there are 24 entries in total – a window of 72 uops. Conceptually, the GCT is akin to a re-order buffer, but it has the advantage of tracking groups of 3 uops in a single entry. This is more efficient for constructing an out-of-order window, but it also means that groups with 1 or 2 uops are underutilizing the available resources.
Like nearly every other modern microprocessor, the z196 uses physical register files and renames register tags to avoid moving actual data around. Each uop in a group is renamed onto the available physical register files, in a process that IBM calls mapping. The architectural state is 48 registers composed of 16 integer, 16 floating point and 16 millicode registers. There are two separate physical register files available for mapping the architectural state. There are 80 GPRs which are 64-bits each, and hold the architectural state, the millicode registers and 48 renamed values. Additionally, there is a pool of 48 FPRs (also 64-bit) for the architectural state and 32 renamed registers.
Unlike most other ISAs, there are no vector instructions in zArch, nor vector registers; everything is based on 64-bit data values. Since the 1960’s, IBM mainframes have been firmly focused on commercial workloads where vectorization is not as beneficial. In contrast, the supercomputing market was dominated in the 1960’s and 1970’s by Control Data and Cray. The HPC market shifted over time to RISC/UNIX systems and at the present is primarily x86/Linux.
There are two separate and asymmetric schedulers, which IBM calls issue queues, at the heart of the z196 core. The first is dedicated for uop groups corresponding to integer instructions only. The second can handle groups with integer, binary or decimal floating point instructions. Groups are sent to a scheduler as a single entity and cannot be split up, which again simplifies the design of the core. Generally, groups alternate between the integer and FP scheduler to smooth out utilization, but any FP group will obviously have to go the appropriate half.
Each scheduler contains 20 entries. Unlike many RISC designs, the z196 (and the z10) are firmly optimized for load-op instructions that implicitly reference memory. Each of the 40 scheduler entries can have a memory operation in flight, so there is no penalty for register-memory instructions. Including operations in various stages of execution, there can be a total of 56 memory operations in the pipeline.
Once the groups have been renamed and the uops are in the schedulers, they can freely execute out-of-order. Dependency and age matrices in the schedulers keep track of whether a given uop is ready, and up to the 5 oldest uops are issued each cycle. The integer scheduler can issue 2 uops/cycle for execution to a set of dedicated execution units. The FP scheduler also has dedicated execution pipelines and can issue 3 uops/cycle. The scheduling portion of the pipeline takes 6 cycles.
In comparison, the z10 receives decoded instructions and places them into an 8 entry Instruction Queue and Address Queue. Instructions in the queues are examined in-order by the grouping logic for scheduling hazards such as address generation interlocks and register dependencies. Once there are no known stall conditions, the group is synchronously sent down one of the two execution pipelines. If the group encounters any stalls during execution (e.g. cache misses) it will replay back to queues for re-issue.
In the z196, when uops have executed successfully, they notify the GCT. Once all the uops in a group have finished, the group can complete. The completion stage takes 4 cycles and includes checking for exceptions, writing back the results, updating architectural state and clearing any buffer entries associated with the group. In contrast, the in-order z10 has a 3 cycle completion stage to write-back results to detect exceptions and update the architectural state by writing back to the register files and store buffer.
One unique feature of both the z10 and z196 is that checkpointing is incorporated into the microprocessor pipeline. After completing an instruction, both designs have 5 stages in the checkpoint and recovery unit. The checkpoint unit generates ECC for completed groups and determines whether any errors have occurred. The group and any architectural changes are written into an ECC-protected checkpoint array. When an error occurs, the checkpoint array is used to restart execution, which will resolve transient problems. If the problem is a hard failure, then the known good state in the checkpoint array is migrated to a new processor.
Discuss (621 comments)