Integer Execution Units
For actual execution, the two microarchitectures are very different as shown in Figure 4. The in-order 2-issue z10 was very carefully designed to minimize latencies without impacting frequency, because stalls are so pernicious. The out-of-order z196 has both considerably more execution resources, and also greater leeway in handling latencies, because they can be effectively hidden.
The z196 schedulers are split into two halves, and this separation continues throughout the 5 pipelines and 6 execution units. The integer scheduler has two dedicated pipelines and execution units – a memory pipeline and ALU. The FP scheduler has three pipelines, a memory pipeline, ALU and a third to the binary and decimal FPUs. When uops are issued from the scheduler, they first go to the physical register files to access the input operands and then proceed to the execution pipelines.
In the z196, the ALUs are independent of the other execution units and pipelines. The ALUs are responsible for executing all integer arithmetic uops, and are completely symmetrical. Additionally, they resolve branch instructions to determine the correct target and direction. The pipeline starts with a cycle to read from the register file and then four cycles for execution. The ALUs can forward the results after the first execution cycle for back-to-back execution in the same unit or forwarding to the store queue. However, the extremely high frequency takes a toll on other paths. To forward from an ALU to the AGUs or the other ALU is a one cycle penalty, owing to the distances involved.
Figure 4. z196 Execution Units and Comparison
The z10 ALUs were tightly coupled to and dependent on the memory pipelines with automatic forwarding. Instructions started at the same pipeline stage (address generation in the memory pipeline), regardless of whether they accessed memory or not. For the common register-memory instructions this substantially simplified the design, but does impose a penalty on simple RISC-like instructions. This helped meet the frequency target for the z10, but was no longer necessary with an out-of-order design. Both ALUs can handle simple branches, but more complex branches (e.g. branch on count, which is a cracked instruction) are handled on port 1.
Floating Point Execution Units
The floating point in IBM’s mainframes is particularly interesting because it predates IEEE 754 by two decades. At the time, nearly every computer company had a proprietary FP format, creating immense confusion which eventually spurred the adoption of a common standard.
IBM was no different and started out with a hexadecimal FP format. Eventually in the 1990’s, IEEE 754 compatible binary FP was introduced to the ISA. Decimal FP was subsequently introduced in the z9 processor in 2005 using millicode and hardware assists. The first full hardware implementation was the POWER6. The z10 used a similar DFP unit with some unique features and was released in early 2008.
The z196 floating point is largely the same as the z10. There is a single pipeline with two execution units, a binary FPU and a decimal FPU. Again, most mainframe customers are running business workloads that are not particularly sensitive to floating point performance.
The binary FPU was originally derived from the POWER6 design, with added support for hexadecimal data. It is a fully pipelined unit that is 9 stages deep and can perform a 64-bit multiply-accumulate. However, there are 2 extra pipeline stages to convert from the native zArchitecture data formats to the internal execution format (which was optimized for PowerPC). Only the first data conversion cycle is visible for dependent uops, since the consumer operation can use the execution format as input. For binary FPU instructions, the pipeline is 13 stages deep in both the z10 and z196.
The decimal FPU is a separate unit that is unpipelined and variable latency. The DFPU is based on a 36-digit adder that takes two cycles to execute and can also act as two 18-digit adders. 34-digit and 16-digit multiplication is easily synthesized from the adder and division uses a more complex algorithm. Actual latencies vary from 12-193 cycles depending on the operation (add, multiply, divide) and operand size (double word or quad word).
One other difference between the two designs is that the z196 can execute integer and FP instructions in parallel by virtue of the out-of-order execution. The previous generation z10 could only overlap certain integer operations such as simple branches, load address or simple register loads with an FPU operation.
The last execution unit in the z196 and z10 is a hardware accelerator that is not shown in the diagram. The compression and cryptography coprocessor is shared between a pair of cores and accessed through millicoded instructions. The coprocessor contains two compression pipelines, each with a 16KB, 4-way cache and a 32 entry TLB that are coherent with the rest of the system. The peak performance is 8.8GB/s expansion and 240MB/s compression. The cipher pipeline executes DES, 3DES and AES encryption with 290-960MB/s throughput. Additionally, a SHA-2 pipeline calculates up to 512-bit keys.
Discuss (621 comments)