Sandy Bridge already had a plethora of native data types. On the integer side, there are byte, word (16-bit), double word, but also half-byte and 32-bit packed half-byte accesses (mostly for media). The floating point formats were much simpler with 8-bit restricted, 32-bit single precision and 32-bit packed restricted data. Ivy Bridge adds a 64-bit double precision format to the mix, which is primarily useful for general purpose workloads, rather than graphics.
The biggest improvement in performance for Ivy Bridge has been doubling the number of operations per cycle for each shader core. The previous generation Sandy Bridge was scoreboarded and could select two instructions per cycle for execution. The two instructions had to be co-issued from different threads, and each thread executed in-order. However, the opportunities for co-issuing were relatively limited because the second pipeline was specialized for math. Ivy Bridge enhances the second pipeline significantly to create more opportunities for co-issuing and attain higher IPC.
Figure 3. Shader Back-end Comparison
Sandy Bridge has a 4-wide vector pipeline that handles all basic instructions including multiply-add, multiply-accumulate and even some complex ones such as sum-of-absolute-differences for media threads. The execution unit processes 16-bit and 32-bit data, and smaller data types such as bytes or half-bytes are stored in a compressed format, but expanded to 16-bits for execution.
The second pipeline that was introduced with Sandy Bridge was a dedicated vector math unit for special functions. Math instructions include inverse, log, square root, reciprocal square root, exponential, power, sine, cosine and divide. Because the instructions in the second pipeline are fairly limited, Sandy Bridge was very rarely able to co-issue instructions. Intel’s architects estimated that co-issue occurred about 10% of the time overall.
For Ivy Bridge, the architects decided to capitalize on co-issue but make the mechanism far more robust by enhancing the execution units to create more opportunities for parallel execution. The second pipeline in Ivy Bridge has been augmented to handle the most common computational instructions. It now includes a 4-wide execution unit that performs single precision floating point multiply-accumulates (or variations thereof) and register moves. However, integer instructions are still issued down the first pipeline. The overall impact of this change is that the theoretical FLOPs per cycle has doubled to 16 and in practice co-issue is possible for roughly 60-70% of the available cycles.
The first pipeline for Ivy Bridge has also been enhanced for double precision support. Intel’s implementation is quite robust and 64-bit floating point instructions are executed at half speed, just like SSE or AVX. Denormal numbers (e.g. underflow) are handled without any latency or throughput penalties. However, the second pipeline does not handle 64-bit data, as it would not be a particularly efficient use of area for a graphics-centric design. So the overall performance for double precision operations on Ivy Bridge is roughly a quarter of the single precision FLOP/s. While some compute-focused GPUs have an overall ratio of 2:1, the vast majority of client designs have a ratio of 4:1 or worse, so Ivy Bridge is quite reasonable.
Conceptually, the right way to look at the Ivy Bridge execution pipelines is somewhat similar to the venerable Pentium. The first pipeline handles all regular execution, whether integer, single precision or double precision, while the second pipeline handles special instructions and single precision floating point. Of course, texturing and memory operations are still executed outside of the shader core, just as with the previous generation.
Discuss (35 comments)