The front-end and memory pipelines for the z196 are clearly derived from the z10. The resemblance is very clear, although there are a number of enhancements. However, the scheduling, register files and execution units are radically different since the z196 is out-of-order and wider issue width.
Both the z196 and z10 have fairly long pipelines to achieve high target frequencies. The basic pipeline for the z10 is 29 stages, and the z196 is a little longer. Out-of-order execution is responsible for extra stages for renaming, issuing and completing instructions. While the total length grew, many of the performance critical loops in the pipeline stayed the same and decreased in absolute latency due to the higher frequency.
With such a long pipeline, the branch predictors must be very accurate to avoid incurring substantial performance penalties. However, massive branch prediction resources themselves are power hungry and can reduce the throughput and latency. The minimum branch misprediction penalty for the z10 is 13 cycles. However, the z196 has the additional complexity of register renaming, out-of-order execution and wider superscalar execution. The minimum misprediction latency in the z196 increased substantially to 19 cycles. As a result, the number of instructions in the shadow of a branch on the z196 is dramatically larger than for the z10, due to the higher latency (19 vs. 13 cycles) and also wider issue width (3 vs. 2 instructions). This highlights the importance of accurate branch prediction, which is the start of the front-end as shown in Figure 1.
Figure 1. z196 Instruction Fetch and Comparison
The z10 branch prediction uses 5 stages in the front-end. It starts with indexing into the 10K entry, 5-way associative branch target buffer (BTB) using the instruction address. BTB entries contain the target, branch type and an 8-state history to predict the direction. Three of the sets can hold 1MB branch target offsets, while two can contain 1GB offsets. Poorly predicted branches are filtered out of the BTB (using one of the 8 states) and held in a separate 512 entry pattern history table (PHT). The z10 also introduced an indirect branch predictor for multiple target branches. The 2K entry MTBTB is accessed in parallel with the BTB. Both the PHT and MTBTB are probed using a hash of the instruction address and the global taken branch history.
The basic prediction throughput is one branch every four cycles, but IBM employed a number of caching techniques to increase the performance. If the branch prediction is in the most recently used set of the BTB, then a cycle can be eliminated increasing the throughput to every third cycle. Additionally, a 4 entry branch target queue (TQ) caches the most recently taken branches and can make a prediction every other cycle.
The z196 branch predictors are designed for lower latency, because the front-end has to feed a wider and more aggressively scheduled core. The BTB shrank to 8K entries and 4 ways, for faster access times. The actual entries hold slightly more information, which increases the prediction accuracy. There is a second level predictor for branch direction, which is a massive 32K entry table. The PHT is a newer design that is tagged with additional pattern history information. The actual array itself is also substantially larger, with 4K entries and the filtering logic has been modified. The MTBTB and the TQ are the same size as in the z10.
The z196 is much more aggressive and can have more speculative instructions in-flight. To account for this, the branch direction predictors can be speculatively updated. The speculative side-copy is tracked until the branch is completed, at which point the direction is known. Additionally, there is a new address mode prediction and the BTB can limit sequentially prefetching to avoid wasting bandwidth.
In both the z196 and z10, the branch prediction is decoupled from the actual instruction fetch. This approach can hide the latency of predicting a branch and also the latency associated with redirecting a taken branch to the target address. It also means that the front-end can prefetch past an instruction cache miss.
Once the instruction address has been determined, the front-end will retrieve the next set of instructions over 5 cycles for the z10 and z196. Both the ITLB and L1 instruction cache are similarly organized to the previous generation z10. The L1 ITLB is 2-way associative with 128 entries. However, the z196 L1 ITLB can map either 4KB or 1MB pages in each of the entries, whereas the z10 was restricted to 4KB pages only. The branch predictors have also been enhanced to track whether the branch target is in a 4KB or 1MB page.
The L1 instruction cache is 64KB and 4-way associative with set prediction to reduce latency. Each cache line is 256 bytes, but the front-end only fetches 32 bytes per cycle. While larger fetches are possible, the core is only 3-wide and instructions are 2, 4 or 6 bytes long, so 32B/cycle is more than sufficient to keep the pipeline busy.
The L1I does not contain any pre-decode information. Once the 32B of instructions have been fetched, they are placed into an instruction buffer that is responsible for pre-decoding. The buffer has three entries, which are referred to as super basic block buffers (SBBBs) in IBM’s terminology. Each entry is 80B in the z196, up from 64B in the z10. The SBBBs mark instruction boundaries, the number of corresponding uops and also check for branches. Once the pre-decoding is finished in the z196, the uops are passed along to the decode stage of the pipeline.
Discuss (614 comments)