The control hierarchy is similar to the GT200, with a global scheduler that issues work to each SM. Previously the global scheduler (and hence the GPU) could only have a single kernel in flight. Nvidia’s newer scheduler can maintain state for up to 16 different kernels, one per SM. Each SM runs a single kernel, but the ability to keep multiple kernels in flight increases utilization especially when one kernel begins to finish and has fewer blocks left. More importantly, assigning a kernel per core means that smaller kernels can be efficiently dispatched to the GPU.
The latency for context switch between kernels has also been reduced by 10X to around 25 microseconds, this delay is largely due to cleaning up the state that each kernel must track – such as TLBs, dirty data in caches, registers, shared memory and the other kernel context.
Figure 2 – Fermi and GT200 Overview
As shown above in Figure 2, Fermi is a big change from GT200, especially for the memory hierarchy. The cores in GT200 are not complete and every group of three cores shares a memory pipeline – this entity is called a Thread Processing Cluster (TPC). Fermi does away with this arrangement and gives each core its own load store unit (LSU) and L1D, albeit one shared between the two execution pipelines.
Fermi SM Overview
The cores (or SMs) in Fermi have been tremendously beefed up and resources have been shifted around substantially. At a high level, the execution resources have quadrupled, but are shared between two scalar execution pipelines; each pipeline has twice the execution resources (or vector lanes) of the GT200 cores. It’s essential to note that while the two pipelines can execute two warps from the same thread block, they are not superscalar in the sense of a CPU. The memory pipeline has also been brought into the core, whereas previously each memory pipeline was shared between three cores. More importantly, the shared memory has been folded into a (semi-coherent) L1 data cache, giving each core a real memory hierarchy.
In many respects, these changes are conceptually reminiscent of the improvements between Niagara I and II. Niagara II doubled the thread count to 8, but each set of 4 threads had a dedicated scheduler and integer (ALU) pipeline, compared to dedicated ALUs and floating point (FPU) pipelines for Fermi. All 8 threads in a Niagara II core shared memory pipelines, just like Fermi, and FPUs, which are analogous to the special function units (SFU).
To utilize those execution resources, the number of threads in flight for each Fermi core has increased by 50% to 1536, spread across 8 concurrent thread blocks. This means that to fully utilize one of the new cores, 192 threads per block are required up from 128 in GT200. As with the current generation, execution within an SM occurs at the granularity of a warp, which is a set of 32 threads. With the increase in threads, each core can have up to 48 warps in-flight at once.
As with all Nvidia DX10 hardware, Fermi has several different clock domains in each core – principally the regular clock for front-end and scheduling, and then the fast clock for actual execution units that runs at twice the regular clock.