For this section, which contains a detailed look at the SM, we have reproduced Figure 5 below.
Figure 5 – Shader Multiprocessor Architecture
The front-end of the SM has a conventional set-associative instruction cache that automatically fills instruction cache lines when a branch target address misses. Currently, NVIDIA has not released any details of precisely where this structure resides, or its organization. The cache management policy, the capacity, access latency and almost all other information is unknown.
Warp instructions are fetched into a multithreaded instruction buffer, which probably contains 32 or 64 entries – one to two entries per warp in-flight.
The instruction issue logic is responsible for selecting a warp instruction to issue each cycle. The instruction buffer is located in close proximity to the issue logic for the SM; or the issue logic may simply be implemented as additional metadata for each warp instruction in the buffer.
Instructions are scoreboarded to prevent various hazards from stalling execution. When all the operands for an instruction and the destination registers/shared memory are available, the instruction status changes to ‘ready’. Each cycle the issue logic selects and forwards the highest priority ‘ready to execute’ warp instruction from the buffer. Prioritization is determined with a round-robin algorithm between the 32 warps that also accounts for warp type, instruction type and other factors.
A warp which has multiple ready instructions can continue to issue until the scoreboarding blocks further progress or another warp is selected for issue. This means that the scoreboarding actually enables very simple out-of-order completion. A warp could issue a long latency memory instruction, followed by a computational instruction and in that case, the computation would end up writing back its results before the memory instruction. This is a very limited form of out-of-order execution, comparable to techniques used in Itanium and much less aggressive (and more power efficient) than a fully renamed and out-of-order issue processor such as the Core 2.
The register files are primarily designed with high bandwidth in mind. The GT200 register files are 64KB, compared to 32KB in the earlier G80 generation. The SM has a total of 16K register file entries partitioned across the SPs. Each of the SPs has a 2K entry register file that can be used by up to 128 threads and probably supports 16 or 24 different banks. Register file entries are 32 bits, and the new double precision data types (both integer and floating point) consume two adjacent registers. The register file is dynamically partitioned between thread blocks by the JIT/driver, and within the allocation for each thread block, the registers are statically assigned to a given thread. An individual thread can have 4-128 registers.
Each SM in the GT200 has a 16KB shared memory that is used for communication between threads in the same block. To provide parallel access to parallel threads in a warp, the 4096 entries are organized into 16 banks with 32-bit bank width. Shared memory is extremely fast – unless there is a bank conflict, the latency is the same as the register file. The shared memory is dynamically partitioned between different thread blocks. Within a thread block, any thread can access the shared memory. The shared memory is an essential element of the CUDA programming model, as it is the only channel for low latency data communication between threads.
One new feature in Compute 1.2 devices (GT200 and newer) to facilitate high speed communication is shared memory atomic instructions. Previously, atomic instructions were implemented using a read-modify-write to global memory, which is fairly high latency (hundreds of cycles) and low performance. On the GT200 and other future GPUs, atomic instructions (including CAS) can be issued against 32 bit operands in shared memory. Atomic operations against a single register are fairly straightforward to implement, and perhaps future GPUs might extend atomic support to 64 bit operations across two registers in the shared memory.
The majority of loads from memory write to the thread private register files, rather than the shared memory. To simplify the design of the SM, the architects decided not to directly load data from the global memory into the shared memory. Instead, to move values into the shared memory, data must be loaded into the registers and then separately moved into shared memory.
CUDA programs also use the __syncthreads() function to synchronize threads of a thread block at inter-thread communication points. The SM implements this with a single barrier synchronization instruction that synchronizes up to 512 concurrent threads in the thread block with low latency.
Discuss (72 comments)