Over the last 10 years, an interesting trend in computing has emerged. General purpose CPUs, such as those provided by Intel, IBM, Sun, AMD and Fujitsu have increased performance substantially, but nowhere near the increases seen in the late 1980’s and early 1990’s. To a large extent, single threaded performance increases have tapered off due to the low IPC in general purpose workloads and the ‘power wall’ – the physical limits of power dissipation for integrated circuits (ignoring for the moment exotic techniques such as IBM’s Multi-Chip Modules). The additional millions and billions of transistors afforded by Moore’s Law are simply not very productive for single threaded code and a great many are used for caches, which at least keeps power consumption at reasonable levels.
At the same time, the GPU – which was once a special purpose parallel processor – has been able to use ever increasing transistor budgets effectively, geometrically increasing rendering performance over time since rendering is an inherently parallel application. As the GPU has grown more and more computationally capable, it has also matured from an assortment of fixed function units to a much more powerful and expressive collection of general purpose computational resources, with some fixed function units on the side. Some of the first signs were when DirectX 9 (DX9) GPUs such as ATI’s R300 and the NVIDIA NV30 added support for limited floating point arithmetic, or programmable pixel and vertex shaders in the DX8 generation. The obvious watershed moment was the first generation of DirectX 10 GPUs, which required a unified computational architecture instead of special purpose shader processors that operated on different data types (pixels and vertices primarily). A more subtle turning point (or perhaps a moment of foreshadowing) was when AMD acquired ATI – many people did not quite realize the motivation was more complicated than simply competing with Intel on a platform level, but in any case, DX10 made everything quite clear.
The first generation of high performance DX10 GPUs – the R600 from ATI and the G80 from NVIDIA – offered the superior power of GPUs, with hundreds of functional units for a specific set of data parallel problems that previously had only been run on CPUs. The emphasis here is on a specific set of problems, as these initial GPUs were only appropriate for extremely data parallel problems that used array-like data structures, with limited double precision needs. While these GPUs were mostly IEEE compliant for 32 bit floating point math, they lacked the usual denormal handling and omitted several rounding modes.
The result is that the computational world is suddenly more complex. Not only are there CPUs of every type and variety, there are also now GPUs for data parallel workloads. Just as the computational power of these products varies, so does the programmability and the range of workloads for which they are suitable. Parallel computing devices such as GPUs, Cell and Niagara tend to be hit or miss – they are all hopeless for any single threaded application and frequently are poor performers for extremely branch-intensive, unpredictable and messy integer code, but for sufficiently parallel problems they outperform the competition by factors of ten or a hundred. Niagara and general purpose CPUs are more flexible, while GPUs are difficult to use with more sophisticated data structures and the Cell processor is downright hostile to programmers.
Ironically, of the two GPU vendors NVIDIA turned out to have the most comprehensive and consistent approach to general purpose computation – despite the fact that (or perhaps because) ATI was purchased by a CPU company. This article focuses exclusively on the computational aspects of NVIDIA’s GPUs, specifically CUDA and the recently released GT200 GPU which is used across the GeForce, Tesla and Quadro product lines. We will not delve into the intricacies of the modern 3D pipeline as represented by DX10 and OpenGL 2.1, except to note that these are alternative programming models that can be mapped to CUDA.