Once instructions have been decoded into COPs, they are dispatched to the back-end of the CPU core for renaming, scheduling, and issuing to the execution units. The renaming, scheduling, and out-of-order execution in Jaguar and Bobcat are clearly descended from the K8 and Barcelona, but tailored for efficiency. The FP and integer clusters are renamed separately, and the schedulers are distributed.
The Jaguar and Bobcat back-ends can dispatch two COPs per cycle from the instruction queue. The first cycle of dispatching (FDEC) checks to ensure that the necessary downstream resources are available, and stalls COPs as necessary. The next pipeline stage, dispatch, performs the renaming and the scheduling stage writes into the scheduling queues.
Each COP allocates a Retire Control Unit (RCU) entry, which tracks exception and retirement status information. The out-of-order window for Jaguar is about 15% larger than Bobcat, with 64 entries, compared to 56 for Bobcat. At the same time, the µops that comprise the COP must allocate the necessary resources, such as rename registers and scheduler entries. Up to 2 COPs per cycle are retired in-order by the RCU, when all the constituent µops finish execution.
The COP is fundamentally a unit for tracking control information throughout the out-of-order pipeline. Once a COP is in the RCU, the emphasis shifts to the constituent µops. The µops form the data flow in the pipeline.
Jaguar and Bobcat use physical register file (PRF) based renaming to reduce power consumption by manipulating register pointers, whereas the K8 and Barcelona move data between architectural and speculative register files. The integer PRFs for Jaguar and Bobcat are 64 entries and hold the architectural and speculative versions of the 64-bit integer registers. Of the 64 physical registers, 20-31 are used to hold architectural and microarchitectural state, while the other 33-44 registers can hold speculative data.
Instructions that do not write to registers (e.g., CMP, which only writes to the flags, or stores) do not require rename registers. After renaming, any integer µop takes a cycle to write into an integer scheduler entry. The integer scheduler contains 20 entries for Jaguar, a substantial increase over the earlier Bobcat which had a 16-entry scheduler. The scheduler is responsible for the out-of-order execution and issues the two oldest integer µops that are ready to the execution units.
Floating Point Cluster
The Jaguar and Bobcat floating point cluster is similar to the K8, in the sense that it is more of a coprocessor model. FP and SIMD COPs are tracked in the RCU, but the actual µops are transmitted to the FP cluster, which includes a separate FP decoder and Retirement Queue for tracking details of FP exceptions. Transmitting and decoding the µops takes two cycles, with a third cycle for register renaming.
The Jaguar FP Retire Queue is 44 entries and renames SIMD and FP µops onto a pool of 72 SSE registers. The FP Retire Queue entries are released when the associated COP is retired from the RCU. The actual re-ordering window is modestly larger than Bobcat, which offers a 40 entry FP Retire Queue and 88 rename registers. One of the most significant improvements in Jaguar is wider 128-bit SSE registers, as opposed to the 80-bit x87/MMX registers used in Bobcat. From a practical standpoint, this means that Jaguar has a scheduling window that is larger than its predecessor, since a 128-bit SSE register requires two registers to rename in Bobcat, but only a single register for Jaguar. One advantage of 128-bit registers is that there are no partial register write penalties for mixing AVX and SSE code.
In Jaguar, AVX instructions that operate on 256-bit data are cracked into two COPs; register-register operations will decode into a single µop, whereas register-memory operations spawn two µops. So in theory, an instruction such as VMULPD can spawn 4 µops – 2x128b loads and 2x128b multiplies.
To enable issuing two FP/SIMD µops per cycle, the physical register file has 4 read ports and 3 write ports, each port is 128-bits wide. The last write port is used for load instructions that target registers in the FP cluster. To save power, the Jaguar physical register file adopts several related techniques to aggressively free registers for use. The first technique is known as the zero-bit, which indicates in the rename table that a register only contains zeroes (e.g., for the upper half of a 256-bit YMM register, or a register that was cleared using XOR) and does not need a physical register. This approach was extended to reclaim 8 temporary FP microcode registers when the microcode is not active. Last, the x87 registers are speculatively assumed to be unused and are similarly reclaimed during ordinary operation. The 8 registers are cached in a special scratchpad memory rather than the PRF – if an x87 operation occurs when the registers are cached, a fault is raised and the registers are copied into the PRF. Lastly, this enables the Jaguar FPU to eliminate µops that do not produce a result (e.g., XORing a register to clear it), saving scheduler entries and power.
Once an FP/SIMD µop is renamed, it takes an addition cycle to enter the unified 18 entry scheduler, which will issue the two oldest, ready µops to the vector and FP execution units.
Discuss (86 comments)