Haswell Out-of-Order Scheduling
Haswell’s out-of-order execution is where the microarchitecture becomes quite interesting and many changes are visible. Haswell is substantially wider than Sandy Bridge with more resources for dynamic scheduling, although the overall design is fairly similar, as shown in Figure 2.
The first part of out-of-order execution is renaming. The renamer will map architectural source and destination x86 registers onto the underlying physical register files (PRFs) and allocates other resources, such as load, store and branch buffer entries and scheduler entries. Lastly, uops are bound to particular ports for downstream execution.
The renamer can take 4 fused uops out of the uop queue for a single thread, allocating the appropriate resources and renaming registers to eliminate false dependencies. Crucially, these 4 fused uops can map to more than just 4 execution pipelines. For example, 4 fused load+execute uops might map to 8 actual uops, 4 loads and 4 dependent ALU operations.
Unlike Sandy Bridge, the renamer in Haswell and Ivy Bridge does not have to handle all register to register move uops. The front-end was enhanced to handle certain register move uops, which saves resources in the actual out-of-order execution by removing these uops altogether.
The Haswell and Sandy Bridge core has unified integer and vector renaming, scheduling and execution resources. Most out-of-order resources are partitioned between two active threads, so that if one thread stalls, the other can continue to make substantial forward progress. In contrast, AMD’s Bulldozer splits the vector and integer pipelines. Each Bulldozer module includes two complete integer cores, but shares a single large vector core between the two. Conceptually, Bulldozer is dynamically sharing the floating point and vector unit, while having dedicated integer cores.
The most performance critical resources in Haswell have all been expanded. The ROB contains status information about uops and has grown from 168 uops to 192, increasing the out-of-order window by around 15%. Each fused uop occupies a single ROB entry, so the Haswell scheduling window is effectively over 300 operations considering fused loads and stores. The ROB is statically split between two threads, whereas other structures are dynamically shared.
The physical register files hold the actual input and output operands for uops. The integer PRF added a modest 8 registers, bringing the total to 168. Given that AVX2 was a major change to Haswell, it should be no surprise that the number of 256-bit AVX registers grew substantially to accommodate the new integer SIMD instructions. Haswell features 24 extra physical registers for renaming YMM and XMM architectural registers. The branch order buffer, which is used to rollback to known good architectural state in the case of a misprediction is still 48 entries, as with Sandy Bridge. The load and store buffers, which are necessary for any memory accesses have grown by 8 and 6 entries respectively, bringing the total to 72 loads and 42 stores in-flight.
Unlike AMD’s Bulldozer, Haswell continues to use a unified scheduler that holds all different types of uops. The scheduler in Haswell is now 60 entries, up from 54 in Sandy Bridge, those entries are dynamically shared between the active threads. The scheduler holds uops that are waiting to execute due to resource or operand contraints. Once ready, uops are issued to the execution units through dispatch ports. While fused uops occupy a single entry in the ROB, execution ports can handle a single un-fused uop. So a fused load+ALU uop will occupy two ports to execute.
Haswell and Sandy Bridge both retire upto 4 fused uops/cycle, once all the constituent uops have been successfully executed. Retirement occurs in-order and clears out resources such as the ROB, physical register files and branch order buffer.
Two of the new instruction set extensions place new burdens on the out-of-order machine. TSX creates a new class of potential pipeline flushes. As we predicted in the earlier article on Haswell’s TM, when a transaction aborts, the pipeline is cleared and the architectural state is rolled back. This looks largely similar to a branch misprediction.
TSX also supports multiple nested transactions, which requires hardware resources to track and disambiguate different levels of nesting. Specifically, Haswell can have 7 nested transactions in flight. Any subsequent transactions will abort as soon as they begin. It is important to note that this limit is microarchitectural and may increase in future generations. Presumably this is linked to some hardware structure with 7 speculative entries and the last entry is for the rollback state.
Gather instructions are microcoded and introduce additional complexity to the microarchitecture. As to be expected, Intel’s approach to gather emphasizes simplicity and correctness. This leaves performance on the table relative to a more sophisticated implementation, but is less likely to result in project delays due to risky design choices.
The number of uops executed by gather instruction depend on the number of elements. Each element is fetched by a load uop that consumes a load buffer entry, while ALU uops calculate the vector addressing and merge the gathered data into a single register. The uops that execute a gather have full access to the hardware and semantics that are quite different from conventional x86 load and ALU instructions, so the implementation of gather is more efficient than what a programmer could create.