The heart of Nvidia’s next generation hardware is really the execution resources within each core; not only have these been increased, but they have been substantially refined as well. The sheer number of ALUs and FPUs per core has quadrupled over the current generation GT200, but they are now split between two independent pipelines with 16 units each. So the latency for a single warp is halved to two cycles and the throughput is doubled with two warps from the different pipelines. Incidentally, this highlights one of the reasons to use a microarchitectural concept like the warp – it enables expanding the core execution resources. Figure 4 below shows the cores for GT200 and Fermi.
Figure 4 – Execution Units for Fermi and GT200 Cores
Each execution unit (which Nvidia calls a “CUDA core”) has a dedicated integer (ALU) and a floating point (FPU) data path. The two data paths share an issue port and cannot be issued simultaneously.
The ALUs have been upgraded with new operations and higher precision. Integer multiplies (and multiply accumulates) are native 32-bits now, instead of 24-bits in GT200, although they execute at half speed – each pipeline can execute 8 integer multiplies per cycle. Additionally, the bit-wise operations required for DX11 are supported, including bit-counts, inserts, extracts and many others. Higher precision support is also a priority for Fermi. Simple 64-bit ALU operations, such as addition and subtraction, can be done by cracking the operation into 32-bit low and high halves, and using the ALUs with half throughput. 64-bit integer multiplication (or a multiply-accumulate) requires four operations, so each pipeline can execute 4 multiplies per cycle and the whole core can execute 8 per cycle.
The floating point microarchitecture has been wholly revamped over the GT200. The GT200 featured a single dedicated double precision (DP) FPU for each core (versus 8 single precision FPUs), resulting in unimpressive performance. This approach was undoubtedly driven by extreme time to market pressures, since it’s clear that there are other more efficient approaches albeit with substantially more design complexity.
For Fermi, Nvidia was clearly aiming to maximize double precision FP performance and embraced the additional required complexity. Each core can execute a DP fused multiply-add (FMA) warp in two cycles by using all the FPUs across both pipelines. Significantly, this is the only warp instruction that requires using both issue pipelines, which suggests certain implementation details. While Nvidia’s approach precludes issuing two warps, instead of a 8:1 ratio of SP:DP throughput (or 12:1 counting the apocryphal dual issue), the ratio for Fermi is 2:1. This is on par with Intel and AMD’s implementation of SSE and ahead of the 4:1 ratio for AMD’s graphics processors. Academic research suggests that the area and delay penalties for this style of multiple precision FPU are approximately 20% and 10% over a single DP FPU , but it is likely that Nvidia’s overhead is somewhat lower. The end result should be an order of magnitude increase in double precision performance at the same frequency – quite a leap forward for a single generation.
The single precision (SP) floating point performance increased with the number of vector lanes, but also became more refined and useful from a numerical perspective. In the GT200, SP was almost IEEE-754 compliant, but lacked certain rounding modes and denormal handling. This is not a problem for graphics, but for real numerical applications it certainly left something to be desired. Nvidia rectified these problems, and also replaced the previous 32-bit multiply-add instructions with a 32-bit fused multiply-add (FMA), bringing it into line with DP behavior (which had FMA since GT200). The distinction here is subtle and has to do with intermediate rounding. In a regular multiply-add, the result of the multiply is rounded to 32-bit precision before the addition is done. With an FMA, there is no intermediate rounding, so the internal precision is higher – which is advantageous for division and square root emulation.
The last change to the execution resources lies in the Special Function Units (SFUs). Since the number of ALUs and FPUs per pipeline has doubled thereby halving the latency of these warps, Nvidia also doubled the number of SFUs upto 4 per core so that the latency of a SFU warp decreased in tandem. However, to save area and power, the SFUs are shared between the two execution pipelines – an eminently reasonable design choice given the infrequent use. As with the prior generation, SFU execution can overlap with ALU or FPU execution, thanks to the operand collectors and result queues.
Interestingly, the apocryphal ‘extra mul’, which only seemed to rear its head in synthetic tests is gone now, an artifact of Nvidia’s marketing history. Thankfully, it has been replaced by two real pipelines that can be used in real applications.