Tukwila and all earlier Itanium designs were VLIW microarchitectures; compiled bundles formed the basis of execution and instructions were statically scheduled. Any dependencies were resolved by global stalls. The global stall microarchitecture would halt the entire pipeline until the problem had been resolved.
Poulson is fundamentally different and much more akin to traditional RISC or CISC microprocessors. Instructions, rather than explicitly parallel bundles, are dynamically scheduled and executed. Dependencies are resolved by flushing bad results and replaying instructions; no more global stalls. There is even a minimal degree of out-of-order execution – a profound repudiation of some of the underlying assumptions behind Itanium.
At Poulson’s heart are the distributed Instruction Buffers, which are replicated – one for each thread. The Instruction Buffers are key to Poulson’s dynamic scheduling – tracking bundles and instructions. Duplicating the IB also enables more sophisticated multithreading. As renamed instructions are decoded they pass from the front-end and are placed into the IB.
Figure 4 – Poulson Instruction Scheduling and Comparison
Each IB can hold 96 instructions from 32 bundles, receive 2 bundles or 6 instructions and issue 12 instructions per cycle. The IB is composed of 7 different queues, which are used for tracking and replaying instructions: two queues for control and five for dispersing actual instructions. This organization separates out control from data flow, to simplify design and results in more scalable circuit implementation.
Two bundles per cycle are placed into the 32-entry control queue. The control queue tracks dependencies between bundles and instructions and can retire up to 4 bundles (or 12 instructions) per cycle. A separate 32-entry bundle queue holds information about each bundle, which is used to track branches and flush the pipeline in case of mispredictions. These two control path queues are the only ones that deal the status and retirement of bundles – the other five are strictly concerned with instruction scheduling.
There is one instruction scheduling queue for each type of instruction (B, A, I, F, and M) and every real instruction in a bundle must be allocated an in-order entry in the right queues. Up to 3 branch instructions can be placed into the 16-entry branch queue every cycle. The branch queue is closely related to and must communicate with the bundle queue, since the two operate together to track branches and mispredictions. However, the branch queue is on the data path for individual instructions, while the bundle queue is a control path that tracks bundles.
The 32-entry ALU queue is used for simple integer (or A-type) instructions, such as addition or logical operations. It can accept 4 instructions per cycle. The vast majority of all integer instructions are simple. More rare and complex integer instructions such as variable shifts, population count and all integer multiplication are classified as I-type. Previously, integer multiplication was actually done in the floating point unit. The 32-entry integer queue similarly receives up to 4 instructions per cycle. All floating point operations (F-type) are placed into the 32-entry FP queue, at a rate of 2 per cycle.
The last and arguably most important instruction type is memory or M-type; the loads and stores that feed the pipeline. Four memory accesses per cycle can go into the 32-entry memory queue. This includes regular 64-bit integer memory accesses, 82-bit floating point accesses (82-bits for precise x86 compatibility) and the floating point load pair instruction which deposits two 32-bit or 64-bit values into an adjacent pair of 82-bit FP registers.
The distributed instruction queues are substantially more flexible than earlier statically scheduled, stalling designs. As mentioned above, if there were not enough execution units for a bundle to execute, then Tukwila would stall at the decode phase. So the number of execution units in Tukwila was designed around the worst case mix of instructions that could be delivered. In contrast, Poulson merely needs to design the instruction buffer queues to accept the worst case mix of instructions. The execution units can be chosen based on average program behavior. The instruction queues are equipped with enough write ports so that Poulson can decode any combination of two bundles that was viable on Tukwila. There are even a few new combinations that can probably dual decode on Poulson (e.g. 4 I-type versus only 2 I-type in Tukwila).
Another advantage of the dynamic instruction buffer is that it removes some of the VLIW overhead associated with Itanium. In particular, instruction slots that cannot be used are packed with NOPs so that the bundle fits a particular format. For example, if only a single instruction can fit in a bundle (e.g. due to dependencies), then the remaining two slots will be filled with NOPs. Research has demonstrated ~20% NOP density for code in SPEC_cpu. This percentage is likely to be higher for branchy and unpredictable integer code that is typical of many server applications. When Poulson’s instruction buffer receives bundles, only the productive instructions are placed into queues; the NOPs are simply ignored. While these NOPs still waste front-end bandwidth, they are effectively removed from the pipeline and no longer consume any resources in the back-end, improving performance and saving power.
Each queue is responsible for scheduling and issuing instructions in program order. However, the five instruction queues can issue out-of-order with respect to each other, so that delays in one queue do not impact another. A given queue will issue the oldest instructions – but as an example, the ALU queue could issue instructions that are many bundles ahead of the oldest memory instructions. In contrast, a full out-of-order mechanism can issue any instruction that is ready, whereas Poulson must issue from the head of each queue. The control queue is responsible for maintaining the original program order and dependencies between bundles.
Instead of relying on a scoreboard to resolve dependencies ahead of execution, Poulson is far more dynamic and flexible. As instructions are issued from the queues, any hazards or complications (e.g. cache misses or register write conflicts) will simply replay the offending instruction back into the queue. Replayed instructions wait until they are ready and then are re-issued – thus avoiding repeated replays like on Intel’s P4. The distributed replay architecture can handle instruction commit, exceptions and stalls faster than Tukwila’s centralized pipeline control. Replay also saves considerable power by enabling forward progress in other pipelines even during cache misses or other stalls.
Discuss (208 comments)