CUDA Memory Model
In addition to an execution model, CUDA also specifies a memory model (shown below in Figure 2) and a variety of different address spaces for communication within the GPU and with the host CPU. The memories colored in red are held in fast, on-chip storage, while the orange address spaces are kept in DRAM. Starting at the lowest level is the register file. Just as in the CPU world, a register file is private for each thread. Should the register file be exhausted, then data spills into local memory, which is also private for each thread, but is held in the frame buffer DRAM, rather than an on-chip register file or cache. An example of data that might be stored in registers or locals memory would be input and intermediate output operands for a thread.
Figure 2 – CUDA Memory Model, courtesy NVIDIA
Next is the shared memory which is used for communication between threads. The shared memory is a scratch-pad or local store memory, but can be accessed by all the threads of a thread block (recall there are upto 512 threads in a block). Shared memory is generally the lowest latency communication method between threads and is approximately the same latency as the registers. Shared memory is used for a variety of purposes, such as holding shared counters (e.g. to calculate loop iterations) or shared results from thread blocks (e.g. calculating an average of 512 numbers to be used in later computations).
There are two read only address spaces that are vestiges of the graphical nature of the GPU – the constant and texture memories. Constant memory is a relatively small space (64KB) used for random accesses (such as instructions), while the texture memory is vastly larger and has two dimensional locality (traditional caches have locality in a single dimension).
Both of these memories reside in the frame buffer DRAM, but since they are read only, they are readily ‘cached’ on-chip. The constant and texture ‘caches’ don’t enforce coherency – they rely on the read-only nature of their underlying address spaces. Thus if the CPU or GPU writes to the constant or texture memories, the caches are invalidated before using the new data. CUDA applications use read-only constants for frequently-accessed parameters, and fetch interpolated and filtered textures for high-bandwidth streaming access to large 2D and 3D images and sampled data volumes.
Lastly, there is the global memory, which is the normal sort of memory – it is globally visible to an entire grid and can be arbitrarily written to and read from by the GPU or the CPU. Since the global memory can be written to, it is not cached anywhere on chip.