The API for CUDA encompasses a whole ecosystem of software, as with any other platform. At its heart is CUDA C which is compiled with NVIDIA’s compiler (nvcc – based on the Open64 back-end). To be clear, CUDA C is not C – it is a variant of C with extensions. The extensions chiefly fall into four different categories:
- Indicating whether a function executes on the host CPU or the GPU
- Indicating where a variable is in the GPU address spaces
- Specifying the execution parallelism of a kernel function in terms of grids and blocks
- State variables which store grid and block dimensions and indices for blocks and threads
The API requires the CUDA driver, which is now included in all NVIDIA graphics drivers. The CUDA run-time, which is a dynamic compiler (JIT) that can target the actual underlying hardware architecture (that NVIDIA does not publicly disclose), is an optional component. Lastly, the API also includes math libraries, cuFFT, cuBLAS and cuDPP which can be called by users (again these are optional, not mandatory).
nvcc can target three types of output: PTX, CUDA binaries or standard C. PTX (Parallel Thread eXecution) is a virtual instruction set designed as input code for a dynamic compiler (contained in standard NVIDIA drivers). The CUDA run-time layer JIT compiles PTX code into the native operations for whatever GPU the user has installed. The beauty of this approach is that PTX provides all the benefits of a stable interface – backwards compatibility, longevity, scalability and high performance, but with greater engineering freedom. This technique guarantees compatibility, but not unlimited freedom – successive generations will be designed to support at least the capabilities and capacities of prior generations, so that today’s PTX programs run fine on future systems.
While PTX with a JIT compiler is likely to be the highest performance approach, it is not suitable for all uses. Some ISVs prefer to give up some performance in exchange for more deterministic and easy-to-validate behavior. The JIT compiler will change its output based on the target hardware and other factors. For ISVs that certify their code (for instance many financial applications), they can compile directly to CUDA binaries (.cubin files) and avoid any uncertainty in the JIT process. Directly compiled CUDA binaries are tied to a both specific graphics card and driver, but the latter is true of almost any professional application.
The last option is for nvcc to output standard C, which is typically redirected to ICC, GCC or another suitable high performance compiler. While CUDA is most useful for writing code that runs on an NVIDIA GPU, once the parallelism is explicitly expressed in an application, it will markedly improve the scalability for multi-core CPUs. Some initial results showed an improvement of 4X over standard x86 compiled code when using CUDA’s form of explicit parallelism.
Evolution of the ISA and Compute Capabilities
CUDA was designed to accommodate change in the software specification and the underlying hardware. The capabilities of CUDA compute devices (NVIDIA GPUs) are described as a revision number, with the first digit indicating the core architecture and the second digit (after the decimal) indicating more subtle improvements.
While CUDA has existed for only a year and change, there have been 3 minor revisions each trending towards more general purpose functionality. Compute 1.1 added atomic functions operating on 32 bit data words in global memory. Compute 1.2 added atomic functions on 32 bit words in shared memory, 64 bit words in global memory, two new warp voting functions (detailed later in this article) and support for the GT200 microarchitecture. Compute 1.3 adds support for double precision floating point values. To date, the GeForce GTX 280 and 260 and Tesla S1070 and C1060 are the only Compute 1.3 devices and there are no Compute 1.2 devices. The lack of Compute 1.2 devices today seems to indicate that at least one future GPU will omit double precision floating point to reduce costs.