The processor arrays form the heart of the GT200’s impressive computational power. The processors are organized hierarchically, and the Streaming Multiprocessor (SM – NVIDIA’s terminology for each multiprocessor) is the lowest independent level of a two level hierarchy. A Streaming Multiprocessor is shown in detail in Figure 5.
Figure 5 – GT200 Shader Multiprocessor Architecture
The Streaming Multiprocessor in reality is a highly threaded single-issue processor with SIMD, although this is obscured by the overall complexity and marketing of the whole architecture. Like a modern CPU core, an SM is the smallest hardware grouping with an independent front end (particularly the fetch unit and scheduling logic), so some of the control logic overhead is amortized over 8 functional units, although some of the control logic is per thread and hence unshared.
Each SM can execute up to 8 thread blocks concurrently and a total of 1024 threads concurrently. While threads and thread blocks are architecturally visible in CUDA, neither is appropriate for efficiently managing threads within the SM hardware – thread blocks are far too coarse grained (upto 512 threads), and individual threads are too fine grained.
Each SM is issued up to 8 thread blocks from the work distribution unit, while each SM thread scheduler manages its threads and thread blocks using an intermediate microarchitectural grouping called a warp. It is essential to note that warps are not architecturally visible (for the most part, there is an exception which is detailed later); warps are mostly a microarchitectural feature that can change from implementation to implementation while retaining complete compatibility (this is one of the key differences between a warp and a vector). That being said, warp size clearly has an impact on performance as will be seen later. Currently a warp is defined as 32 threads for CUDA; warps are defined differently for various graphics functions, primarily based on the latency of the operations involved and the relative latency to memory. The current generation SM can have as many as 32 warps in-flight, while the previous generation in the G80 was limited to 24 warps or 768 threads – the number of thread blocks is the same for both generations.
The SIMT Programming Model
Unlike a modern microprocessor, NVIDIA’s SM eschews speculative execution – there are no branch predictors, nor any mechanisms to rollback incorrect results. Much like the Niagara architecture, it is assumed that there is always non-speculative work available to execute and hence no need to speculate and consume extra power. As a result, warps which execute a branch wait until the target address from every thread in the warp has been calculated and the selected next target instruction is fetched from the instruction cache – at which point the warp can continue.
NVIDIA describes their execution model as Single Instruction, Multiple Thread (SIMT) a variant on SIMD. From a programming perspective, the big difference is that vector width is architecturally visible for SIMD and data must be packed and unpacked into vectors for computation. In the SIMT model, execution width is a microarchitectural feature handled solely by hardware and a SIMT instruction such as a conditional branch specifies the behavior of a single independent thread. When a warp diverges (i.e. threads within a warp are actually executing from different instruction pointers), performance gracefully decreases. If there are N divergent paths in a warp, performance decreases by about a factor of N, depending on the length of each path. An N-way divergent warp is serially issued over the N different paths using a hardware stack and per-thread predication logic to only write back the threads taking each divergent path.
The other major distinction is that in a vector, elements can freely communicate between each other since they reside in the same address space; different threads within a warp can only communicate using shared memory in SIMT – their registers are strictly private.
In an interesting twist, some of NVIDIA’s newest additions to CUDA actually violate the SIMT architecture and philosophy. As mentioned previously in this article, the Compute 1.2 specification includes warp voting functions, particularly the __any() and __all() functions. The __any() function takes a predicate as an argument and returns true if the input predicate is true for any of the 32 threads in a warp. The __all() function works similarly but requires that the predicate evaluate true for every thread in the warp. What is interesting about these two functions is that warps are now an architecturally visible feature of Compute 1.2 devices. This is perhaps one step down the path towards explicit vectors and the classic SIMD model.
Discuss (72 comments)