The Out-of-Order Engines – Renaming and Scheduling
The first thing to notice is that the out-of-order control logic is vastly more complicated for K8 and Barcelona than the Core 2. Unlike Intel’s microarchitecture, the K8 and Barcelona have a split integer and floating point cluster, with distributed schedulers/reservation stations. The Core microarchitecture, as discussed previously, has a single execution cluster, with a unified scheduler/reservation station and multiple issue ports. These choices, which date back to the P6 and the Athlon are one of the major factors that accounts for the strength of AMD microprocessors in floating point workloads.
Figure 3 – Comparison of Out-Of-Order Resources
The pack buffer, which is part of the decoding phase, is responsible for sending groups of exactly 3 micro-ops to the re-order buffer (ROB). However, the 72 instruction re-order buffer is not actually 72 independent entries. It contains 24 entries, with 3 lanes for instructions in each entry. The re-order buffer contains a rename register for the result of each operation in flight (or in the case of a FP operation, a pointer to the FP register file).
To ensure that the ROB is fully utilized, only a single group of exactly three instructions can be sent to the ROB each cycle. Therefore, the function of the pack buffer is twofold; it coalesces instructions into groups of three so that they can enter the ROB. Just as importantly, the pack buffer can move instructions between lanes to avoid a congested reservation station downstream or to observe issue restrictions. For example, floating point or integer multiplies must be in the first lane, while LZCOUNT must be in the third. Each lane corresponds to a specific reservation station further down in the pipeline, and once an instruction enters a specific lane in the ROB, it cannot be moved. Thus the pack buffer is also the last chance to switch lanes for an instruction.
At this point, the path for floating point and integer/memory instructions diverge. The next stop on the integer side is the Integer Future File and Register File (IFFRF). The IFFRF contains 40 registers broken up into three distinct sets. First, the Architectural Register File, which contains 16×64 bit non-speculative registers specified by the x86-64 instruction set. Instructions can only modify the Architectural Register File once they have retired, with no exceptions. Speculative instructions instead read from and write to the Future File, which contains the most recent speculative state of the 16 architectural instructions. The last 8 registers are scratchpad registers used by the microcode. In the case of a branch misprediction or exception, the pipeline must rollback, and architectural register file overwrites the contents of the Future File.
From the ROB, instructions issue to the appropriate scheduler. The integer cluster contains three reservation stations (or schedulers). Each one is tied to a specific lane in the ROB and holds 8 instructions, with the source operands. The source operands come from either the Future File, or the result forwarding bus (which is not shown because it is too complicated to draw).
The Floating Point Cluster
Floating point instructions are handled quite differently. Instead of being sent directly to the reservation stations, they first head to the FP Mapper and Renamer. One of the nasty aspects of the x86 instruction set is that FP operations are stack based; the FP mapper converts these stack operations to use a flat register file instead so that renaming can occur.
In the renamer, up to 3 FP instructions each cycle are assigned a destination register from the 120 entry FP register file. The file is large enough to rename up to the maximum of 72 instructions in flight. Along with the FP register file, there are two arrays, the architectural and future file arrays. In Barcelona, the architectural file array contains pointers to 44 of the 120 FP registers, which contain the non-speculative state: 8 for x87/MMX, 8 scratchpad registers for the microcode and 8×128 bit XMM registers. Previously, the K8 treated the XMM registers as 16×64 bit registers, but that changed once 128 bits became a ‘native’ data format. Similarly, the future file contains pointers to 44 renamed registers that contain the latest speculative values within the FP register file.
Once the micro-ops have been renamed, they may be issued to the three FP schedulers. Each reservation station holds up to 12 instructions, with the source operands. Like the integer schedulers, the operands can either come from the FP register file, or the forwarding network and each scheduler is tied to a specific lane in the ROB.