OpenCL Memory Model
The OpenCL memory model defines how data is stored and communicated both within a device and also between a device and the host CPU. There are four memory types (and address spaces) in OpenCL, which closely correspond to those in CUDA and DirectCompute, and they all interact with the execution model.
The first region, global memory, is available to any work-item for both read and write access. Global memory may be cached in the OpenCL device for higher performance and power efficiency, or may reside strictly in DRAM. Global memory is also fully accessible by the CPU host. Constant memory is a read-only region for work-items on the OpenCL device, but the host CPU has full read and write access. Since the region is read-only, it is freely accessible to any work-item. Conceptually, constant memory can be thought of as a portion of global memory that is read-only for the OpenCL device.
The remaining memory regions are only usable by the OpenCL device and are inaccessible to the host. The first is private memory, which is accessible to a single work-item for reads and writes, and corresponds roughly to an architectural register file in a classic instruction set. The vast majority of computation is done using private memory, thus in many ways it is the most performance critical. The second region is known as local memory and is accessible to a single work-group for reads and writes. Local memory is intended for shared variables and communication between work-items, in essence, it is an architectural register file that is shared between a limited number of work-items. Local memory can be held in DRAM and cached – which is how most CPUs will implement it, while GPUs tend to favor dedicated hardware structures that are explicitly addressed.
The memory consistency model for OpenCL is fairly relaxed, with a number of primitives to assist. OpenCL defines four work-group synchronization primitives – a barrier, and 3 types of fences (read fence, write fence, and a general memory fence). The barrier synchronizes an entire work-group, so the scope is limited by definition. The strength of the memory consistency is progressively weaker as the scope widens – which makes sense. A strongly ordered model is easier with fewer caching and memory agents, and increasingly difficult to scale as more agents are added.
At the smallest scope, each work-item has fairly strong consistency and will preserve the ordering between an aliased load and store; however, non-aliased memory instructions can be freely re-ordered. Local memory is a bit weaker – it is only consistent across a work-group at a barrier. Without a barrier, there are no ordering guarantees between the different work-items. Global memory is even weaker still; a barrier will guarantee consistency of global memory within a work-group, however, there are absolutely no guarantees between different work-groups in a kernel. Global atomic operations were an optional part of OpenCL 1.0 and are required in 1.1; they are used to guarantee consistency between any work-items in a kernel, specifically between different work-groups. Atomic operations are primarily defined for 32-bit integers, with an optional extension for 64-bit integers. They acquire exclusive access to a memory address (to ensure ordering) and perform a read-modify-write, returning the old value. Both OpenCL and CUDA return the old value, while this is strictly optional for DirectCompute. However, the performance cost of atomic operations is fairly high on some hardware, and should be avoided for scalability and performance. Since the constant memory is read-only, it needs no consistency or ordering model.
OpenCL uses a combination of pointers and buffers to move data within an application. Pointers are valid within a kernel – however, they are flushed at the end. So passing data between kernels (or between the host and device) uses buffers. This is another area where OpenCL diverges from CUDA – the latter persists pointers across kernels and does not use any buffers.