CUDA Execution Model
NVIDIA’s parallel programming model and API is known as the Compute Unified Device Architecture or CUDA. The philosophical and architectural underpinning of CUDA is to create a virtually limitless sea of thread level parallelism, which can be dynamically exploited by the underlying hardware. The CUDA programming model focuses on the GPU as a highly threaded coprocessor to the host CPU and the associated memory model. The GPU is only useful for extremely data parallel workloads, where similar calculations are run on vast quantities of data that are arrayed in a regular grid-like fashion. Some classic examples of data parallel problems are image processing, simulating physical models such as computational fluid dynamics, engineering and financial modeling and analysis, searching, and sorting. However, workloads which require more complex data structures, such as trees, associative arrays, linked lists, spatial subdivsion structures, etc. will fare poorly on current GPUs which primarily work with array data structures. Once a data parallel workload has been identified, portions of the application can be retargeted to run on the GPU. The sections of the application that run on the GPU are known as kernels. Kernels are not full applications – rather they are the data parallel essence of each major step of an application.
Figure 1 – The CPU and GPU – Serial code, Kernels, Grids and Blocks
Along with the data parallel kernels, there is the standard sort of serial code. This serial code lies between kernels (otherwise two kernels would be combined into one), as shown in Figure 1 above. In theory, the serial code could just clean up from one kernel and set up for another, however, todays GPUs are relatively limited, so there is substantial work in the serial portions of an application.
A kernel is executed by a grid of thread blocks. A thread block is a group of up to 512 threads that start at the same instruction address, execute in parallel and can communicate through shared memory and synchronization barriers. Each thread in a block starts at the same instruction address, and in theory could diverge to a separate code path. However, for performance reasons divergence within a block is limited. Kernels are executed by many thread blocks. A grid is simply a collection of thread blocks that can (but are not required to) all execute in parallel. This is one of the beautiful features of CUDA – since the blocks are unordered, they can execute equally well on a GPU that can handle one block at a time and one that executes a dozen or a hundred at a time, a scalability that is hard to achieve. Currently, a kernel corresponds to a single grid, but it is very likely that in the future NVIDIA will relax this constraint. Since in theory every block in a grid can execute in parallel, there is no ordering between blocks – the grid boundaries serialize computation, although data can be fetched from one kernel/grid while another kernel/grid is computing.
Threads composing a thread block share data, so they must be issued to the same processor (which NVIDIA calls a Streaming Multiprocessor or SM). Within each thread block, a thread is issued to a single execution unit (which NVIDIA calls a Streaming Processor or SP core). This begins to touch on the internal structure of NVIDIA’s GPUs, so further details must be deferred a bit.