OpenCL Execution Model
General purpose computing on GPUs has been a topic of interest for a considerable time. The early work was in academia, primarily in the Stanford graphics group, and focused on using the existing limited shader languages (e.g. Brook) for general workloads. Many of the Stanford graphics graduate students went into industry and influenced the evolution of GPUs into programmable hardware. The first commercial API was CUDA, which has in turn influenced later APIs such as OpenCL and DirectCompute. All three APIs use variants of C that add and remove certain features. None of the languages are a superset of C, so not all C programs will map cleanly to the respective languages. Given the shared ancestry and shared starting language, it should not be surprising that there are many similarities between the three.
OpenCL, DirectCompute and CUDA are APIs designed for heterogeneous computing – with both a host (i.e. CPU) and an OpenCL device. The device can be the same hardware as the host – for instance a CPU can serve as both – however, the OpenCL device is often different (e.g. a GPU or DSP).
OpenCL applications have serial portions, that execute on the host CPU, and parallel portions, known as kernels. The parallel kernels may execute on an OpenCL compatible device (CPU or GPU), and synchronization is enforced between kernels and serial code. OpenCL is distinctly intended to handle both task and data parallel workloads, while CUDA and DirectCompute are primarily focused on data parallelism.
A kernel applies a single stream of instructions to vast quantities of data that are organized as a 1-3 dimensional array (called an N-D range). Each piece of data is known as a work-item in OpenCL terminology, and kernels may have hundreds or thousands of work-items. At a high level, this sounds a lot like SIMD execution where each work-item is a SIMD lane. However, one of the key goals of OpenCL is to provide an extensible form of data parallelism that isn’t explicitly tied to specific vector lengths and can be mapped to all sorts of different hardware. So in some sense, an OpenCL kernel is a generalization of SIMD. The kernel itself is organized into many work-groups that are relatively limited in size; for example a kernel could have 32K work-items, but 64 work-groups of 512 items each. Unlike traditional computation, arbitrary communication within a kernel is strongly limited. However, communication and synchronization is generally allowed locally within a work-group. So work-groups serve two purposes. First, they break up a kernel into manageable chunks, and second, they define a limited scope for communication.
Kernels form the basis of OpenCL, but they can be composed into a task graph via asynchronous command queues. The programmer indicates dependencies between kernels, and what conditions must be met for a kernel to start execution. The OpenCL run-time layer can simultaneously execute independent kernels, thus extracting task parallelism within an application. While the initial uses of OpenCL will probably focus on data parallelism, the best performance will be achieved by combining task and data parallel techniques.
OpenCL defines a broad universe of data types for computation in each work-item. On the integer side, data types include boolean, character, short, int (32-bit), long and long long (128-bit). Most of these integer types are available in both signed and unsigned variants.
For floating point, OpenCL both defines a variety of data types and also specifies precision for most operations. The floating point data types are relatively standard – single precision is required and double precision is optional. In addition, there is half precision (16-bit) floating point for data storage; computation is still done at single precision, but for less precise data, the storage requirements can be cut in half. Thankfully, OpenCL also enforces a minimum level of floating point precision and accuracy, generally consistent with IEEE 754. Double precision has the most stringent requirements, including a fused-multiply-accumulate instruction, all four rounding modes (nearest even, 0, +infinity, -infinity), and proper handling of denormal numbers, infinities and NaN. Single precision is somewhat more lax and only requires round to nearest even and handling infinities and NaN. In both cases, all operations have a guaranteed minimum precision – this is especially critical for math functions that are implemented in libraries, such as transcendental functions. Half precision requires an IEEE compatible storage format and correct conversion.
OpenCL also provides a number of more sophisticated data types on top of these basic ones. Most data types (except half-precision and boolean) are part of the specification in vector form, with lengths 2, 4, 8 and 16. Vector operations are component-wise so that each lane is independent. This is a clear contrast to DirectCompute and CUDA, which only support vectors of length 2-4. OpenCL has pointers for many data types, which is beneficial to make developers comfortable, but it does come with a cost because it ends up creating potential aliasing problems (just as in C). Vectorization is critical for performance on many CPUs and GPUs (although not Nvidia GPUs), and will be much more heavily emphasized in OpenCL than in CUDA.
There are also data types for 2 and 3-dimensional images and texture sampling and filtering of images. The standard has reserved a number of other data types such as complex numbers (using floating point formats for the imaginary and real parts), matrices and high precision formats (128-bit integers and floating point). These are not part of OpenCL, but it is clear that they are all candidates for inclusion.