The Itanium instruction set was influenced by the RISC philosophy, with an emphasis on simple instructions and relying on the compiler for complex operations. The ISA is a strict load-store model and specifically designed to avoid any complex instructions that would have to be decoded into multiple uops – unlike x86, zArch and even the ostensibly simple Power and ARM. Itanium also has no microcode, and instead stole a page from Alpha. The firmware uses a Processor and System Abstraction Layers (PAL/SAL) to create a standard software interface to the outside world and handle tasks like booting, power management and machine check error handling. Lack of virtualization was an oversight in the original ISA, but it was later added through hardware and PAL code.
Decoding takes two stages and is where Poulson begins to significantly deviate from Tukwila and resemble a more conventional in-order pipeline. Rather than preserve Itanium’s VLIW semantics, Poulson actually breaks bundles apart into constituent instructions. These individual instructions, instead of bundles, form the basis of further execution.
Figure 3 – Poulson Decode and Comparison
Previous Itanium designs have a decoupling buffer between fetch and decode that can hold up to 8 bundles (or 128B of instructions) that have been requested or prefetched. The Instruction Bundle Buffer smoothes out the flow of bundles and lets the front-end run ahead of execution and also prevents stalls in the front-end (e.g. bubbles due to branching) from propagating to the rest of the pipeline. Poulson has a decoupling Instruction Buffer, but it is later in the pipeline and the division between front-end and back-end is fairly different from earlier pipelines. In essence, Poulson places the buffer between decoding and execution, whereas McKinley, Montecito and Tukwila place the buffer between fetching and decoding. Since there does not appear to be a fetch buffer, Poulson’s decoders should use same multithreading policy as the fetch stage of the pipeline.
In the first decode stage, Poulson decodes up to two bundles into as many as 6 instructions. Itanium instructions come in several varieties: branches (B), memory (M), simple integer (A), complex integer (I) and floating point (F). The decoding on prior designs was intrinsically tied to instruction issue (called dispersion) and the execution units. If there were insufficient execution resources for the instructions in two bundles to issue (e.g. 4 branches, but only 3 branch units), then only the first bundle would be decoded, and the second would stall until the next cycle – effectively halving the IPC. The first Itanium, Merced, only had two memory units, which was a tremendous problem for performance. One of the largest improvements in McKinley was adding 2 memory pipelines, which enabled far more bundles to dual issue. Poulson is free from such limitations and can decode bundles without regard for the execution units. Instead the limits on decoding come from the write ports on the the instruction buffers, which are far cheaper than execution units. Additionally, Poulson has a new 64-bit integer multiply instruction.
Ironically, while Itanium avoided out-of-order execution, the architecture still relies on register renaming. Each function also allocates a variable number of virtual stack registers that are mapped to physical registers. Any number of stack registers can be used to pass parameters between functions – the hardware simply sets the right virtual to physical register mapping. Additionally, a subset of the integer, floating point and predicate registers can rotate for loop pipelining. In essence, rotation renames the virtual registers used for each iteration of a loop, so that multiple iterations can safely execute in parallel.
Architecturally, there are 128 general purpose and 128 floating point registers, plus the previously mentioned 64 predicate and 8 branch address registers. The first 32 integer and floating point registers are fixed and global. The upper 96 integer registers are stacked and can optionally be rotated. The upper 96 floating point registers rotate, as do the upper 48 predicate registers. The bottom line is that stacked and rotating registers mean that Itanium must rename the integer, floating point and predication registers. Since the renaming is limited, this is far easier and less resource intensive than in most out-of-order designs.
Poulson’s second decoding stage is for register renaming and mapping the architectural registers through stacking and rotation onto the underlying physical registers. Previously renaming was done in the back-end; instructions were first dispersed and then renamed, right before being executed. Poulson reverses the order – renaming before the decoupling buffer and dispersal, to start register file reads earlier in the pipeline. This reflects the more dynamic nature of the design, and a step away from rigidly static VLIW predecessors.
This is also the last of the 4 pipeline stages in the front-end. Once the instructions are decoded and renamed, they are put into the Instruction Buffers based on the instruction type and proceed down the back-end for execution.
Discuss (208 comments)