Terminology and Summary
One of the more confusing aspects of GPU computing is the terminology. The lexicon for CPUs and computer architecture is relatively consistent across vendors. For graphics APIs, there is a common language and understanding formed by DirectX and OpenGL that most hardware and software can follow. However, for graphic hardware, the terminology varies considerably and is often imprecise and subject to change – fortunately, the common APIs give some semblance of order. In contrast to graphics and computer architecture, the idea of using GPUs for computation is relatively new. The industry standards are nascent, but hopefully OpenCL and DirectCompute will provide a relatively standard language to understand the software aspects of GPU computing. While the terminology in these APIs may not be universally adopted, they will reduce confusion by providing common ground. Equally important, since OpenCL is intended to run on almost any device, the common software architecture will be very helpful to understand the different flavors of hardware.

Table 1 – Comparison of OpenCL, DirectCompute and CUDA
The table shows the correspondence between different terminology in OpenCL, DirectCompute and CUDA and also compares certain features. One of the differences in the execution model for OpenCL and DirectCompute specifically omit any microarchitectural aspects of execution and avoids horizontal operations. Both of these choices improve portability and performance across many different devices. CUDA is a proprietary API and portability was never a goal, so Nvidia exposes warps and certain horizontal warp functions through the API.
The three APIs have changed the local memory capacity over time, in tune with advances in hardware. Early versions including OpenCL 1.0, Compute Shader 4.x and 1.x CUDA specified 16KB local memory; although in OpenCL that was only a minimum. The local memory for OpenCL 1.1 must be at least 32KB, while DirectCompute requires exactly 32KB. CUDA 2.x takes a slightly different tack and mirrors Nvidia’s Fermi hardware – allowing local memory to be configured as either 16KB or 48KB. The reason that OpenCL focuses on minimum sizes for local memory is enabling a diversity of hardware. Most CPUs will use regular system memory, held in a cache, for local memory. Even L1 caches can easily exceed 32KB, and L2 and L3 caches are orders of magnitude larger. Moreover, the Cell processor actually has 256KB of local memory.
The changing storage capacity serves to highlight one of the pitfalls of OpenCL. While the specification does ensure functional correctness across different platforms, it does not guarantee optimal performance. Hardware can vary across a number of aspects: number of work-groups, latency, bandwidth and capacity of on-chip and off-chip memory (e.g. cache or registers, DRAM, etc.). Tuning for a specific platform will often result in suboptimal code for other platforms. For example, using 4-wide vectors on AMD GPUs is necessary for optimal performance, while Nvidia GPUs only see mild gains. As a result, software optimized for Nvidia platforms typically is unvectorized and will not run efficiently on AMD GPUs. This problem is universal for almost any cross-platform environment. However, the variations in performance for OpenCL on GPUs will be much larger than say, Java on CPUs, because the variations in microarchitecture are also much larger. One related issue is that all memory is statically allocated in OpenCL (i.e. at compile time), without any knowledge of the underlying hardware. Dynamically allocating memory (i.e. at run-time) would help to improve performance across different hardware. As a simple example, software that is written for a smaller 16KB local memory will leave performance on the table when using hardware that has more capacity (say 32KB, like AMD’s GPUs).
Despite its flaws, OpenCL holds great promise as an open, compatible and standards based approach to parallel computing on GPUs and other alternative devices. At present though, OpenCL is still in the very early stages with limited hardware and software. However, it has broad support throughout the PC and embedded ecosystems, and is just starting down the path to maturity as a common API for software developers. Judging by history though, OpenCL and DirectCompute will eventually come to dominate the landscape, just as OpenGL and DirectX became the standards for graphics.