Once instructions are issued from the queues, they proceed to a decode stage. It is unclear exactly what happens in this second decode stage, but it is likely that the first decode stage only determines the instruction type to disperse to the correct queue. Presumably, the second decode stage selects the exact operation and the input and output operands.
In the next stage, register inputs are read from the massive physical register files, which are duplicated – one for each thread. The size of Poulson’s branch and predicate register files is unchanged from predecessors. There are 8 branch registers and 64 predicate registers; the first 16 predicate registers are fixed, while the upper 48 may rotate. The 128-entry floating point register file has also been left largely intact. However, Poulson added 32 entries to the integer register file. Since only 128 integer registers are visible to a single procedure, the extra registers reduce spilling the stack back to memory (also note that there are 16 shadow registers for fast interrupt handling). Register spills force the Register Stack Engine to write back registers to the data cache and consume valuable bandwidth, while slowing execution. Once the inputs for instructions have been read from the register file or bypass network, they are actually executed.
Figure 5 – Poulson Execution Units and Comparison
Each of the instruction queues can issue 2 instructions per cycle to execution units, save the branch queue which can sustain 3 branches per cycle. All together, Poulson can issue and execute 11 instructions, elide NOPs, and retire 4 bundles. However, the balance of instruction queues and execution units is substantially different from Tukwila and earlier generations, and shows a changing emphasis for Itanium from floating point to server workloads. Issuing 11 instructions seems like overkill for a design that can only fetch 6, but there are benefits, particularly for handling cache misses. When a load goes to main memory, it will certainly stall the progress of the thread – when the miss returns, issuing 11 instructions can help to clear out any instructions waiting to be replayed.
Poulson has 3 branch units, 2 simple ALUs, 2 integer units, 2 FPUs and 2 memory pipelines. Tukwila had 3 branch units and 2 FPUs, but no pipelines for simple ALU instructions, which could execute on any of the 4 memory pipelines or 2 integer units. While Poulson’s FPU latency is unknown, most integer operations are single cycle latency for dependent integer operations. In addition, there is a new 4-cycle, 64-bit integer multiplier on at least one of the two integer pipelines, used for both multiply and multiply-add instructions.
One of the hallmarks of McKinley and its derivatives was high performance on the DAXPY kernel – sustaining two load FP pairs, two FMAs and two FP stores per cycle. This is critical for many workloads that rely on linear algebra, and requires 4 memory units. However, the high performance computing and workstation market – where this is relevant is almost entirely x86 at this point. So there is little point in tuning Poulson for DAXPY, hence the architects switched two memory pipelines into simple ALUs.
Poulson’s execution units brings us to the question of multithreading. Each thread has a set of instruction queues, but how do the two threads share the execution units and memory pipelines? Intel did little to answer this question.
Since the instruction buffers are replicated structures, the simplest approach is fine-grained multithreading. In a given cycle, one of the two instruction buffers would be able to issue up to 11 instructions to the execution units. As the NOPs have been removed earlier, this is a fairly straight forward and reasonable design choice.A higher performance option is true simultaneous multithreading (SMT) – issuing from both threads to take full advantage of the 12-wide back-end. This would keep the instructions queues from both threads active every cycle, and increase power consumption a bit. The tricky part would be selecting which instructions from each thread to issue to the execution units. A straight forward technique would be to give one thread priority, and then let the second thread take advantage of any unused pipelines. Alternatively, the pipelines could be split in half between the two threads, but that seems more likely to reduce single threaded performance.
Regardless of the multithreading, once instructions have executed, there are three more pipeline stages. One for detecting replays/exceptions, then two cycles to write back and retire results. While scheduling and execution is done at the instruction level, retirement is still done at bundle granularity. Instructions that have completed will wait in the instruction queues until the whole bundle can retire.
Discuss (208 comments)