Haswell Execution Units
The execution units in Haswell are tremendously improved over Sandy Bridge, with far more changes than the out-of-order execution mechanism. Intel’s microarchitectures from Merom through Sandy Bridge could all dispatch 6 uops/cycle to the execution units. Three dispatch ports were dedicated for arithmetic operations and three ports for memory accesses. One of the most significant improvements in Haswell is the addition of a new integer dispatch port and a new memory port, bringing the execution to 8 uops/cycle. Furthermore, the execution units that are attached to each dispatch port have been rebalanced and augmented with single cycle 256-bit integer SIMD execution. Last, with the new FMA instructions, Haswell can perform 16 double precision or 32 single precision FLOP/cycle, plus a 256-bit shuffle and a basic integer ALU operation. The theoretical peak performance for Haswell is over double that of Sandy Bridge.
Every cycle, the 8 oldest, non-conflicting uops that are ready for execution are sent from the unified scheduler to the dispatch ports. As shown in Figure 3, computational uops are dispatched to ports 0, 1, 5 and 6 and executed on the associated execution units. The execution units are arranged into three stacks: integer, SIMD integer and FP (both scalar and SIMD). Note that Figure 3 does not show every execution unit, due to space limitations.
The three stacks do not need to be tightly coupled because they operate independently of each other. It is quite unusual that an FP operation would need a scalar integer input operand. Each stack has different data types, different result forwarding networks and potentially different registers. The data path for accessing registers and forwarding results only connect within a given stack (and register file). The data path consists of a huge number of wires (256 wires to send 128-bits of data) and the power and area savings of this arrangement are quite significant. The downside is that forwarding between different networks may incurs an extra cycle to move between different stacks. The load and store units on ports 2-4 and 7 sit on the integer bypass network, to reduce latency for forwarding and access to the general purpose registers (GPRs).
The new port 6 on Haswell is a scalar integer port. It only accesses the GPRs and the integer bypass network. The execution unit can handle standard ALU operations, and contains a shifter and branch unit that were previously on port 5. One of the advantages of the new integer port is that it can handle many instructions while the SIMD dispatch ports are fully utilized. For example, when the Sandy Bridge core executes a tight loop, branches dispatch on port 5, which prevent a shuffle or blend from dispatching. With Haswell, this is no longer a problem.
The port 0 integer units carry over an ALU and shifter from Sandy Bridge, losing a fast LEA unit that moved to port 5 and gaining a brand new branch unit. The integer units on port 1 are unchanged, with an ALU, fast and slow LEA, and integer MUL. Port 5 has significantly changed and includes an ALU and a fast LEA unit, while losing the shift and branch units. Overall, the number of integer ALUs and branch units increased by 1, doubling the throughput for branches.
The Sandy Bridge bypass networks are 64-bit for integer, 128-bit for SIMD integer and 128-bit for FPU. Crucially, a port could forward a result to each of the three bypass networks every cycle. To forward full 256-bit AVX results, the FPU and SIMD networks were used simultaneously to send the lower and upper 128-bits. To support the FP-only AVX instructions and uops, Sandy Bridge merely added FP execution capabilities to the 128-bit SIMD stacks, without adding any expensive wiring for the bypass network.
AVX2 is largely an integer SIMD extension, so Intel’s architects applied the same conceptual technique to achieve single cycle 256-bit SIMD integer execution. Haswell adds 128-bit integer SIMD execution capabilities to the existing 128-bit FP stack, while re-using the FP bypass network to forward half of the 256-bit results; again saving area and power. This means that Figure 3 is more representative of the logical view of the execution units, rather than the underlying physical implementation. In essence, the FP and SIMD stacks are now identical, forming the lower and upper halves of the data path. Both stacks are 128-bits wide and have the same FP and SIMD execution units.
Turning to the SIMD execution units, Haswell boasts a wide variety of improvements for AVX2 that roughly double the throughput. The SIMD ALU, multiplier and shifter on port 0 are all 256-bit wide, while the divider, string and AES unit remain at 128-bits (these latter units are not shown in Figure 3). Port 1 extended two execution units to 256-bit SIMD, specifically the vector ALU and blend unit (the blend unit is not shown for Sandy Bridge due to space considerations). The shuffle unit on port 1 in Sandy Bridge was removed from Haswell. However, it was not a full shuffle, but executed a limited sub-set, mitigating the performance impact of this change. The ALU, shuffle and blend units on port 5 were all extended to 256-bits, with no other changes (the vector blend is not shown for Sandy Bridge or Haswell). Generally, the SIMD integer performance doubled due to wider 256-bit AVX2 instructions, although the throughput for certain shuffles is the same and may decrease in some corner cases (e.g. 128-bit shuffles that could issue on port 1 and 5 in Sandy Bridge).
The new FMA instructions also required significant changes for the floating point units in Haswell. Intel’s architects opted for a fully pipelined fused multiply-add that keeps the latency at 5 cycles, the same as FMUL, so that the extra add in the FMA is essentially free from a latency perspective.
The port 0 FPU gained a fully pipelined 256-bit FMA unit, and can still be used for multiply, blend and divide uops. Note that the 128-bit divider on port 0 is not fully pipelined and is shared by both all uops (integer, SIMD integer and FP). Port 1 added a 256-bit fully pipelined FMA unit that can also be used for 256-bit multiplies and a 256-bit blend unit, while retaining the existing FP add and conversion units (the FP blend and conversion units are not shown in Figure 3). Lastly, the floating point units on port 5 are unchanged from Sandy Bridge, with a 256-bit shuffle and blend.
One interesting point is that while Haswell can execute two 256-bit FMAs or FMULs per cycle, there is still only a single 256-bit FADD. The motivation for this is primarily latency. Based on extensive performance modeling, Intel’s architects found that most workloads were more heavily dependent on FADD latency, rather than bandwidth. Simulations showed that two FADD units with higher latency were inferior to a single FADD unit with 3 cycle latency on port 1. Of course, clever assembly programmers can use the FMA unit on port 0 to perform a 5 cycle FADD, although the code scheduling would be rather complex. For recompiled code, the floating point performance has basically doubled by virtue of the FMA instructions, yielding 16 DP FLOP/cycle for each Haswell core. Existing code that depends on FMUL throughput will also get substantially faster.