Of course, the hallmark of modern GPUs is the incredible density of execution resources and aggregate computational power. NVIDIA achieves this performance through SIMT, a modified variant of SIMD. As described earlier, SIMT enables the performance of a SIMD processor, but hides the architectural complexity and has smoother performance degradation in the face of conditional branches with warp divergence.
Figure 5 – Shader Multiprocessor Architecture
As Figure 5 above shows, there is a single warp instruction is issued each cycle and feeds into a high speed functional unit cluster that runs at twice the frequency of the fetch and issue logic, registers and shared memory.
The primary execution resources are a set of eight 32 bit ALU and MAD multiply-add units, which execute mostly IEEE compliant single-precision floating point and integer 32-bit ALU warp instructions in 4 cycles (4 fast clock cycles, as opposed to the slower ‘core’ clock of the control logic and storage arrays). Each cycle, 8 threads from the warp read their operands, perform the operation and then write back the result.
Control flow instructions (e.g. CMP) are executed by the branch unit. As previously noted, since GPUs do not speculate, a warp that encounters a branch will stall until it can be resolved. Like the arithmetic operations, a branch warp takes 4 cycles to execute.
In addition to these standard functional units, each SM includes two execution units which are used for less frequent operations. The first is a brand new dedicated 64 bit fused multiply-add unit shared by the entire SM for integer and floating point computation that relies on the new support for 64 bit operands in the register file. The double precision FMA unit supports standard IEEE 754R behavior for double precision operands, with full speed denormal handling and classic 64 bit integer arithmetic. It performs a fused multiply-add with a single rounding on the result, enabling accurate iterative convergence algorithms. Since there is only one double precision FPU, it should come as no surprise that the performance is 8-12 times lower than for 32 bit arithmetic (depending on how you count). However, NVIDIA is fully aware of the growing importance of double precision arithmetic and can be expected to increase the performance with the next generation.
Last is the Special Function Unit which is a cluster of several units and handles the remainder of the unusual and exceptional instructions in the GPU. The SFU is responsible for executing transcendental functions, interpolation for parameter blending, reciprocal, reciprocal square root, sine, cosine, and a bevy of other unusual functions. Instructions which are natively executed by the SFU have a 16 cycle latency, while more complicated functions such as square root or exponentiation are synthesized by the compiler using a combination of instructions and take 32 cycles or longer to execute. The SFU hardware for interpolation includes several 32-bit floating point multiply units, which can be issued separately for multiply instructions instead of an interpolation. The SFU is physically implemented as two execution units; each one services four of the eight execution pipelines in the SM, and multiply instructions issued to the SFU execute in the same 4 cycles as in the FMAD unit.
While those are all of the functional units in the SM, obviously the processor must be able to load and store data from memory. Loads and stores are issued to the SM Controller and handled in the texture pipeline, which will be discussed later.
One of the interesting complexities of NVIDIA’s microarchitecture is the relationship between latency and throughput. In general, CPUs execute most operations in a single cycle – but the latency of the fastest operation for an SP core is 4 cycles. Since the SM can issue one warp instruction every 2 ‘fast’ cycles, it should be possible to have multiple instructions in flight at once. In fact, this ability is what NVIDIA refers to as ‘dual issue’ – although in reality it is simply parallel execution across functional units. The SP cores execute one instruction over 4 cycles, and other execution units can be in use, processing a different warp instruction simultaneously.
Figure 6 – Dual ‘Issue’ for NVIDIA’s Execution Units
As Figure 6 illustrates, the SM can issue a warp instruction every 2 clocks. In the first cycle, a MAD is issued to the FPU. Two cycles later, a MUL instruction is issued to the SFU. Two cycles after that, the FPU is free again and can execute another MAD. Two cycles after that, the SFU is free and can begin to execute a long running transcendental instruction. Using this technique, the computational throughput of the shader core is increased by 50%, while retaining the simplicity of issuing only one warp every 2 cycles, which simplifies the scoreboarding logic. Not all combinations can be executed in parallel. For instance, the double precision unit and the single precision units share logic and cannot be active simultaneously as a result.
Intense Register Pressure
The register files for the 8 SPs and the multi-banked shared memory are responsible for feeding the incredibly dense and double clocked computational core of the SM, a feat in and of itself. In a single fast clock cycle, the execution units can perform up to 8 FMADs and 8 FMULs. Each FMAD requires 3 input operands and an output, while each FMUL requires 2 inputs and an output. So in a single fast clock cycle, a total of 40 inputs and 16 outputs are needed. Of course, the register files and shared memory do not run on the fast clock, but on the core clock. So between the register file and the shared memory, there must be a total of 80 inputs and 32 outputs available – an aggregate of 112 ports across all the storage arrays. Of course, a couple of other units use the register files as well – prominently the texturing unit and load store pipelines, so the total is probably closer to 128.