The execution units in Sandy Bridge were reworked to double the FP performance for vectorizable workloads by efficiently executing 256-bit AVX instructions. Almost all 256-bit AVX instructions are decoded into and execute as a single uop – in contrast to AMD’s more cautious embrace of AVX, which will crack 256-bit instructions into two 128-bit operations on Bulldozer. When Intel introduced SSE2 in the P4, each 128-bit instruction was cracked into two 64-bit uops, and the throughput did not substantially improve. This created a chicken and egg problem: Intel wanted developers to use SSE2 (since the P4 was not designed to execute x87 particularly fast), but developers do not want to rewrite or recompile code for a marginal gain. This is one reason why it took 5-8 years for SSE2 to truly become pervasive.
Sandy Bridge can sustain a full 16 single precision FLOP/cycle or 8 double precision FLOP/cycle – double the capabilities of Nehalem. This guarantees that software which uses AVX will actually see a substantial performance advantage on Sandy Bridge and should spur faster adoption. Intel seems to have learned from the lessons of SSE2 and hopefully, the uptake for AVX amongst the software community will be far swifter.
Figure 5 – Sandy Bridge Execution Units and Comparison
As Figure 5 above indicates, Sandy Bridge can execute a 256-bit FP multiply, a 256-bit FP add and a 256-bit shuffle every cycle. However, the floating point data paths were not expanded and are still 128-bits wide; instead the SIMD integer data paths are enlisted to assist with AVX operations. Understanding this technique requires a slight detour and further exploration of the execution units.
The scheduler will issue the oldest, ready to execute uops each cycle to the six available ports, depending on the type of uop. Ports 0, 1 and 5 are used for executing computational uops. There are three types of computational uops and execution units or execution stacks: integer, SIMD integer (which we will refer to as SIMD) and FP (either scalar or SIMD). Each of the 3 execution ports has hardware for the 3 different types of uops. For example, on Sandy Bridge the FP stack for port 1 can execute a 128-bit FP add, and the SIMD stack can execute a 128-bit SIMD add. Each of the three execution stacks is considered to be a different ‘domain’. There is free bypassing within each domain, but a 1-2 cycle penalty for bypassing between domains (e.g. an integer uop that receives a forwarded result from an FP uop). This technique is primarily helpful to save power and reduce the complexity of the forwarding network for rarely used cases. The memory pipeline and ports 2-4 are within the integer domain, since integer uops are the most latency sensitive.
Instead of widening the data paths to 256-bits, the Sandy Bridge architects moved the integer SIMD stacks to slightly different issue ports and cleverly re-use the existing 128-bit SIMD and 128-bit FP data paths ganged together to execute 256-bit uops. For example, a 256-bit multiply can issue to port 0 and simultaneously use the 128-bit SIMD data path for the low half and the 128-bit FP data path for the high half. This technique requires some extra logic, but it saves substantial area and power, by re-using execution resources that are already present. The 256-bit shuffle on port 5 also requires dedicated hardware for crossing between the two 128-bit lanes. Fortunately, all the extra logic to re-use the SIMD execution units is relative small and power efficient compared to the area and leakage necessary to double the data paths.
Figure 5 above shows the major uop types for each execution unit, but due to space constraints some are omitted from the diagram. At a high level, Sandy Bridge makes some improvements to the integer execution units, moves around some of the SIMD units and enhances the FP units. On the integer side, there is a new second port for LEA. For SIMD integer, port 0 and port 1 have effectively swapped places (compared to Nehalem/Westmere) and port 5 gained a shuffle. Note that in both designs, PSAD and string uops map to the same port as the integer SIMD multiply. There are also integer blends on ports 0 and 5. On the floating point side, the execution width doubled, the shuffle moved to port 5 and blends were added to ports 0 and 5. Both Nehalem and Sandy Bridge have FP moves in ports 0 and 5 as well. For those interested in the full details, the Sandy Bridge optimization manual should provide a comprehensive description when it arrives.
Sandy Bridge also improves performance for certain security primitives, such as the microcoded AES instructions that were added with Westmere and large number arithmetic. Sandy Bridge improves SHLD (shift left double) performance, which is used for SHA-1 hashing. The throughput for ADC (add with carry) doubled, which is used for large number routines calculations such as RSA.