Sandy Bridge’s out-of-order execution is one of the areas where the marriage of the P6 and P4 microarchitectures can be seen most clearly. Like the P6, Sandy Bridge generally treats integer, floating point and SIMD uops in a unified manner. This is a direct contrast to AMD’s various microarchitectures, where floating point and SIMD are handled by a co-processor (Bulldozer) or a separate cluster. Sandy Bridge uses fundamentally different, and more efficient, techniques for tracking and renaming the uops in flight, as compared to the approach used in P6 derivatives such as Nehalem. Instead, Intel’s architects borrowed from and adapted concepts earlier used in the P4 microarchitecture and simultaneously expanded the out-of-order resources substantially.
Figure 4 – Sandy Bridge Out-of-Order Execution and Comparison
As shown in Figure 4, Sandy Bridge uses a physical register file (PRF) based renamer. Intel first used this approach for the P4, and AMD has also adopted PRFs for Bulldozer and Bobcat. Interestingly, almost all high performance out-of-order designs are using the same approach, as IBM uses PRFs for their POWERx line as well.
As described in our Bulldozer article, the earlier P6 design (up to Nehalem and Westmere) used a Re-Order Buffer (ROB) that contained both data and status bits for uops in flight and a separate Retirement Register File (RRF). Retired uops would write their results from the ROB into the RRF. In a PRF design, all data is held in the PRF, a separate structure holds status information and speculative and logical registers are mapped into the PRF. Retirement is handled by simply changing the mapping so that an architectural register points to the correct value in the PRF, rather than moving data.
For Sandy Bridge (and the P4), Intel still uses the term ROB. But it is critical to understand that, in this context, it only refers the status array for in-flight uops (in Bulldozer, this status array is called the Retirement Queue). The two PRFs contain all the data values (both renamed and architectural). A third structure, the Register Alias Tables, maintain the mapping of logical registers (i.e. most recent speculative and also architectural state) to the underlying physical registers in the PRF.
Moving to a PRF based microarchitecture creates a level of indirection and changes many of the critical steps in the out-of-order execution. Allocation, renaming, scheduling and retiring are all different for Sandy Bridge, and minimize the movement and replication of data within the processor. Of course, indirection is not free and probably adds an extra pipeline stage of latency, but the benefits are worth the trade-off.
PRF-based renaming and scheduling is substantially more power efficient because it eliminates movement of 32, 64, 128 or 256-bit data values and instead relies on pointers and changing the vastly smaller mapping tables. Intel’s architects used these power savings to substantially enlarge the out-of-order resources in the core and increase per-core performance. Sandy Bridge’s ROB was increased to sustain 168 uops in-flight, up from 128 in Nehalem. The number of rename registers more than doubled, partially because some uops may require multiple rename registers (e.g. a register, memory uop may needs an integer rename register for the load address and an separate register for the result). Sandy Bridge’s integer PRF holds 160 distinct 64-bit values, while the FP PRF is 144 entries and 256-bits wide to accommodate the full YMM registers in AVX. The scheduler expanded to hold 54 uops and the memory ordering buffers also substantially increased in size.
When using AVX, the architectural state of an x86 thread increases substantially. The YMM registers hold 256B more data than the XMM registers, which creates extra work when executing a SAVE or RESTORE instruction for a context or VM switch. In Sandy Bridge, the hardware tracks which registers are used during execution and will selectively save and restore only the registers that are actually used. Given that the basic register state for x86-64 is slightly over 700B (including 64-bit GPRs, x87 and YMM), this can be quite helpful. For example, a thread only using the 16 GPRs would eliminate moving roughly 600B of data on a context switch.
Another small benefit of a PRF-style design is zeroing registers. A common idiom in x86 code is clearing a register by XORing it with itself, to break any dependencies. In previous generations, this required a uop in one of the ALUs to overwrite the register. However, with a PRF-based design, zeroing a register can be accomplished solely within the renamer – by simply adding the register to the list of freely available registers in the PRF. This trick may also be used with VZEROALL instructions in AVX.
Sandy Bridge’s front-end can deliver 4 uops per cycle from one of the two threads. The uops are still processed in-order at this point, and allocate the necessary resources to track execution. Each uop will allocate a sequential entry in the ROB, to track status and completion information and maintain the correct program order. Integer uops require an available PRF entry to rename their output, while FP and SIMD uops use the FP PRF. Any uops that access memory must also allocate an entry in the appropriate load or store buffer. As previously mentioned, 256-bit AVX instructions decode into a single uop and will not consume extra ROB or PRF entries.
Once uops have been allocated and renamed, they can freely execute out-of-order. Sandy Bridge uses a unified scheduler that is dynamically shared between threads, like Nehalem, but nearly twice the capacity. This has the advantage of allowing a more flexible mix of instructions to execute efficiently compared to distributed schedulers. Renamed uops are entered into the 54 entry unified scheduler, where they wait until they are ready to execute. When a uop is ready, it will be issued to the appropriate execution units. Like Nehalem, Sandy Bridge can issue up to 6 uops to different ports and retire 4 uops per cycle.