With all the complex register access in the Gen architecture, scoreboarding is necessary to avoid destination hazards. If two instructions could write to the same destination, the scoreboard will stall the second instruction until the first has safely finished. Many of these stalls are not necessary, e.g. two instructions could write to different portions of a register, or one instruction might get masked off. The driver software can actually override the scoreboarding, although this may produce incorrect results.
Sandy Bridge and Ironlake have a variety of data types. The integer data types are fairly straight forward including byte, word (16-bit), double word (32-bit) and the slightly unusual half-byte and 32-bit packed half-byte. Floating point data is available in single precision, restricted (8-bit) and a 32-bit packed restricted form.
The GPU has two FP modes – IEEE and ‘alternate’. The IEEE mode is partially compliant with the relevant standards and has NaNs, infinities and denormals. However, there are some deviations including rounding behavior and denorm handling. The alternate is a graphics specific mode that does not have NaN, infinity or denormals. Extremely large (or small) numbers will saturate at the maximum (or minimum) that can be represented. To avoid any NaNs, special functions such as log, reciprocal square root and square root will take the absolute value of any input to guarantee good behavior. The advantage of alternate mode is higher performance and greater freedom in software optimization. For example, in alternate FP, multiplying by 0 is always 0 – that is not true for IEEE.
The execution units in each Gen 6 core have been substantially beefed up compared to the Ironlake generation. Both Sandy Bridge and Ironlake cores have a 128-bit wide vector execution unit that natively executes eight 16-bit or four 32-bit operations per clock cycle. While data can be stored in 4-bit or 8-bit formats for compression, it is expanded to 16-bits for actual execution and has similar throughput. The shader cores also execute media operations such as sum-of-absolute-differences in the vector pipeline.
Figure 5 – Shader Back-end Comparison
The older Ironlake core does not have any multiply-add instructions and very limited multiply-accumulate support, so the peak throughput is 4 SP FLOP/cycle. The Gen 6 vector execution unit has both multiply-add and multiply-accumulate. The latter implicitly uses a high precision accumulator in the ARF for each channel of execution. The peak throughput for the Gen 6 GPU is 129.6 GFLOP/s (using the turbo frequency of 1.35GHz), compared to 43.2 GFLOP/s for Ironlake (turbo frequency of 0.9GHz). The Gen 6 core added a couple of new instructions, plane equation and linear interpolation, which are fairly common in graphics and were previously synthesized in software rather than directly executed.
The Gen instructions are variable length vectors and typically longer than the hardware’s execution resources. The longest uncompressed instruction is 8x32b operations (for the SIMD1x8 mode), which takes two cycles to execute. A compressed instruction can take 4 cycles to execute on 16 data items. This multi-cycle execution is similar to the behavior of an AMD wavefront, with the added twist that the instruction latency is non-uniform, which complicates scheduling slightly.
Just as importantly as the improvements in the vector unit, Sandy Bridge has dramatically improved performance for special math functions such as transcendentals. Previously, Ironlake shared a single 32-bit math unit between an entire row (3 cores). The math instructions included inverse, log, square root, reciprocal square root, exponentiation, power, sine, cosine and integer divide. These instructions were sent through the messaging framework to the math unit and most took 22-88 cycles per data element. The more complicated trigonometric instructions typically took 132 clocks, but could be as high as 264 cycles for each data element.
The Gen 6 core has a 32-bit dedicated math unit with a new floating point divide instruction. The math unit is also faster for some transcendental instructions than the previous generation, particularly the trigonometric ones. Threads can issue one instruction to the math unit or the vector unit; however, the latency of most math instructions is fairly long, so the execution is still mostly simultaneous.
Sandy Bridge has a much more balanced ratio of arithmetic and transcendental execution units. The most common use of special function is applying some scale factor to an entire vertex or pixel (e.g. normalizing or rotating an object). AMD GPUs have a roughly 4:1 ratio, while Nvidia’s vary between 4:1 and 8:1. In contrast, Ironlake’s 12:1 ratio seems too low for good performance, while Sandy Bridge should be just right.
Discuss (65 comments)