The Road to Enlightenment
The CUDA programming and memory model have been described in detail in our previous analysis of the GT200. This model is descended from DirectX and OpenGL, rather than the typical programming models exposed by PowerPC, x86 or ARM. As a result, CUDA and other GPU programming models carry many of the limitations from these graphics APIs – a byzantine collection of address spaces and memories, with very simple data structures such as arrays or matrices. One of the clear trends for Nvidia’s next generation of hardware is to break down these barriers and enable greater degrees of indirection and programming support.
The first set of these changes are those that impact control flow. There are three principal improvements here. The first is indirect branching, which is common enough for CPUs to merit its own branch predictor type, but was previously prohibited for Nvidia GPUs. Indirect branches are those where the target address is not explicitly encoded in the instruction. Instead the instruction specifies a register holding the branch target (e.g. JMP %eax in x86), which enables a single branch to target multiple addresses in different situations. Indirect control flow enables virtual functions for C++, and other languages.
Secondly, fine grained exception handling has been added. Previously there was no real exception handling. Now, each individual vector lane (which Nvidia calls a thread) can trigger an exception handler, which is needed for the try and catch methods commonly used in C++. This is entirely non-trivial – to trigger an exception the state for the vector lane must be safely moved somewhere, so that when the exception handler starts it can inspect the state.
What is interesting is that the two features mentioned above can be cleverly combined to achieve the third major programmability change – enabling the GPU itself to make calls. The ability to save state prior to an exception can be re-used to save the caller’s state, and then the indirect branching can be used to target a function which has already been loaded into the address space. Recursive calls are supported as well, although Nvidia has not currently disclosed the ABI and calling conventions. This is a pretty substantial step forward in terms of programmability, again moving away from a simplistic GPU model towards a more full featured environment.
One of the key steps for Nvidia with their next generation hardware is to clean up their messy memory model and create a single unified address space for local (thread private), shared (for each thread block) and global memory (device and system-wide). These have all been consolidated into the existing 40 bit virtual and physical address space used by the GT200. Previously separate load instructions were used to access each address space, and these have been supplanted by a single load instruction. At compile time, the addresses for these memories are determined, and the hardware is configured to correctly route fetches to the appropriate sub-space.
With a unified address space, indirection for data structures is possible. Nvidia now supports pointers and object references, which are necessary for C++ and most other high-level languages which pass by reference. On the same note, Nvidia has added a new addressing mode as required by OpenCL. Images are a first class citizen for OpenCL and require (x,y) addressing mode to improve handling graphical data.
These are the most notable global changes to Nvidia’s ISA, but there are other more subtle changes that will be covered later in the appropriate sub-section.