Overview of a GT200
As a computation device, the GT200 is a multi-core chip organized in a two level hierarchy that focuses primarily on achieving high compute throughput on data parallel workloads by sacrificing single-thread performance and execution latency. This is in contrast to general purpose processors which focus primarily on single-thread performance and execution and communication latency, with a secondary focus on high compute throughput.
A single GT200 is composed of 10 Thread Processing Clusters (TPC), which form the first level in the hierarchy. At the lower level, each TPC is further made up of 3 Streaming Multiprocessors (SM) or Thread Processor Arrays (TPAs), plus a texture pipeline, which serves as the memory pipeline for each group of 3 SMs. Each SM loosely corresponds to a core in a modern microprocessor with 8-wide SIMD; each SM has its own front-end, complete with instruction fetch, decode and issue logic, and execution units, etc. although note that the memory pipeline is shared across each TPC. Figure 4 below shows a high level architectural overview of the GT200 and its predecessor, the G80. The major differences which are called out in this diagram are that the G80 has 8 TPCs and 2 SMs per TPC, for a total of 16 SMs per chip versus 30 SMs spread across 10 TPCs for the GT200.
Figure 4 – GT200 and G80 Architectures
At this point, things can be a bit confusing. The CPU and GPU world have somewhat different terminology that can be slightly confusing, especially once NVIDIA or ATI’s marketing departments have been thrown into the mix. NVIDIA and ATI like to call their execution units (ALUs/FPUs) ‘cores’, so that they can claim to have hundreds of cores (this is complicated by the fact that NVIDIA’s ‘cores’ run at twice the frequency of ATI’s, making comparisons difficult). In reality, GPUs have more like tens of cores, but are able to provide more computate power by using vectors, which lower the amount of control overhead per computation. NVIDIA’s diagrams of the G80 and GT200 go a bit beyond Figure 4 and subdivide each SM into 8 Thread Processors, Streaming Processors (SP) or Shader Cores (depending on who you are talking to). However, it is clear that Stream Processors in NVIDIA’s terminology are not truly independent processor cores. Each SP has a register file (at least a portion of one) and an independent instruction pointer, but the SPs lack a complete front-end that can fetch and schedule instructions independently. In that regard, the SPs most closely correspond to an issue pipeline in a modern multi-threaded CPU.
The GPU work distribution unit (or global block scheduler) manages coarse grained parallelism at the thread block level across the whole chip. When a CUDA kernel is started, information for a grid is sent from the host CPU to the GPU. The work distribution unit reads this information and issues the constituent thread blocks to SMs with available capacity. The work distribution unit issues thread blocks in a round-robin fashion to SMs which have sufficient resources to execute it. Some of the factors that are accounted for are the kernel’s demand for threads per block, shared memory per block, registers per thread, thread and block state requirements, and the current availability of those resources in each SM. The end goal of the work distributor is to uniformly distribute threads across the SMs to maximize the parallel execution opportunities.