AMD’s graphics processors are programmable in nature and the compute capabilities are exposed through two APIs, the industry standard OpenCL and Windows specific Direct Compute. Both of these APIs are relatively nascent. OpenCL was first launched in November of 2008 and version 1.1 was released in June 2010. DirectCompute is an integral part of DX11 which launched with Windows 7 in October, 2009. DX11 also allows compute shaders to execute on older DX10 hardware, albeit with limited functionality. In contrast, Nvidia’s proprietary CUDA API is much more established – it was initially released in February of 2007, a 2 year headstart. As a result it has more applications – mostly in high-performance computing and wider adoption today. However, OpenCL and DirectCompute have greater momentum as they are supported by all CPU and GPU vendors and not just Nvidia. Initially, OpenCL was hampered because it was essentially only available for Nvidia hardware early on, and CUDA has a more mature ecosystem. OpenCL has the support of many vendors – both within the PC ecosystem (e.g. AMD and Intel) and in the mobile world (e.g. Imagination Technologies). While CUDA is still clearly the larger ecosystem – it is equally clear that going forward the momentum is behind OpenCL and DirectCompute.
AMD’s initial ‘CTM’ programming interface for GPUs was extremely low level and exposed most hardware details. When OpenCL was first being discussed, it became clear that AMD would ultimately deprecate their CTM efforts and use the existing work to enable their two APIs of choice. As a result, AMD still offers programmers a high degree of visibility and control over the hardware, even though the proprietary CAL initiative is long gone and OpenCL and DirectCompute are higher level.
AMD is primarily focused on graphics applications, which are responsible for the overwhelming majority of GPU sales. As a result of this emphasis, and AMD’s architecture choices, their GPUs are in the middle of the programmability spectrum. They are not as easily programmable as Nvidia’s GPUs, but clearly are ahead of Intel’s GPU efforts. There is an inherent trade-off between performance and programmability for any device – CPU or GPU. Within the realm of GPUs, AMD simply chose a different balancing point compared to other GPUs.
AMD’s programming model hews very closely to OpenCL, which provides a convenient starting point for terminology. The instruction set architecture of AMD’s GPUs is fundamentally a Very Long Instruction Word (VLIW) approach, packing multiple instructions that can execute in parallel into a VLIW. The very nature of VLIW means that the microarchitecture is explicitly exposed to the compiler and the programmer. A kernel’s instruction stream may be arbitrarily long, but it will be compiled into control flow instructions and a number of finite length (and relatively short) clauses. Instructions in a kernel are not freely commingled and clauses may only contain a single type of instruction. Control flow instructions initiate clauses, write data back to a buffer or memory and control branching, looping, subroutines and general program flow. The five principal types of clauses are texture fetch, vertex fetch, memory load, memory export and ALU. The first four types are respectively limited to 8, 16, 16 and 1 instruction, while the ALU clauses can have up to 128 VLIW bundles. The texture and vertex fetching are used primarily for graphics, while the latter three are used for general purpose workloads. Clauses are always executed to completion and cannot be interrupted.
The data stream of a kernel is architecturally defined as a number of work-groups. OpenCL limits work-groups to 1024 work-items or less. For efficiency reasons, AMD only supports work-groups with 1-256 work-items; 64 or 256 are the most common. However, the GPU executes work-groups using wavefronts, an intermediate microarchitectural grouping that is both a data and control flow construct. For current AMD GPUs, a wavefront is defined to be 64 work-items wide (data flow) and 1 instruction or VLIW bundle deep (control flow). Thus a work-group typically corresponds to 1 or 4 wavefronts.