NVIDIA’s GT200: Inside a Parallel Processor

Pages: 1 2 3 4 5 6 7 8 9 10 11 12


At the time of writing this article, there are several GT200-based products available. They are all concentrated on the high-end of the market and span both professional and consumer applications. Given the enormous die size and power consumption of the GT200, mainstream products in the $150 range will have to wait till either 55nm, or until NVIDIA removes some of the TPCs (or more likely, both).

Table 1 – Comparison of GT200-based Products

The professional products use dual rank DRAMs to increase capacity (32 DRAMs), at the cost of ~20% lower bandwidth. The professional products are also able to achieve higher frequencies, by virtue of much more expensive cooling solutions (particularly the rackmounted Tesla S1070). Table 1 above describes the four different products. While it seems odd that the S1070 has lower power consumption than the GTX 280, there are several contributing factors. First of all, the S1070 and C1060 do not use any of the fixed function graphics hardware in the TMUs or ROPs. This includes the blending and interpolation hardware, most of the special function unit and quite a few other portions of the chip. These will all be clock gated off and reduce power consumption as a result. Secondly, the cooling solutions for the professional systems are much more robust, which lowers the junction temperature in the GPU and in turns lowers the leakage and thermal dissipation.


Graphics processors have rapidly matured over the last several years, leaving behind their roots as fixed function accelerators and growing into general purpose computational devices for highly parallel workloads (as exemplified by graphics). Some of the earliest discussions of GPUs as computational devices date back to SIGGRAPH 2004 and reference even earlier academic work at Stanford (Brook) and Waterloo (Sh). NVIDIA has been the most consistent and aggressive in pursuing this vision, with both software and hardware.

In 2007, NVIDIA launched CUDA which is the most mature of the emerging programming models and toolchains for GPUs. Certainly there are quite a few other efforts in the same field, some pre-dating CUDA and some created in response to CUDA. ATI was one of the first to advocate GPGPU (working with Stanford on their folding@home client), but their efforts were hampered by the ATI/AMD acquisition and a lack of key staff to take GPGPU from concept to reality in a unified manner. That has since been rectified and folks at ATI are hard at work on CAL, Brook+ and OpenCL. Intel is promoting the ultimate in programmability for GPUs – inheriting all the advantages (and disadvantages) of x86 compatibility and also supporting OpenCL. While this is certainly alluring and all trends point towards full programmability in C/C++ as the end game, Intel will have no discrete GPUs till 4Q09 at the earliest. Apple began working on what would become OpenCL to avoid being tied to any specific hardware vendor, although that is still pre-release and under the auspices of the Khronos group. While CUDA is not an open industry standard and does not work with ATI or Intel GPUs, it is readily available and far more mature and programmer friendly than the other alternatives. For the time being, CUDA is the only game in town for parallel computing.

The hardware side of the equation is equally important. NVIDIA’s GT200 is an incredibly aggressive derivative of the G80 architecture, maximizing the die area and power consumption to provide substantial improvements in almost every aspect of the architecture. The improvements are quite numerous and pervasive, increasing the overall performance and adding more features for programmability to further NVIDIA’s vision of GPGPU. Many of these changes are obvious, such as the registers per SM, the number of SMs, or the ratio of SMs to memory pipelines, but some of the more subtle ones deserve a brief mention. Of particular importance are the much more powerful and flexible memory access coalescing rules in the GT200, although the shared memory atomic instructions, warp voting functions and 64 bit integer and floating point also deserve a mention.

The bottom line is that the GT200 takes several important steps forward in terms of programmability and certainly cements NVIDIA’s status as the leader in general purpose computation on GPUs for the next year or so. While these improvements certainly contributed to the GT200’s incredible power consumption and die area, it’s clear that the long term trend favors trading power and area for programmability and that NVIDIA will be pushing further in that direction for future generations.


As many are aware, this was my first article to examine a GPU. Starting the article was a little like landing in a foreign country – nothing is exactly familiar, the language is a little different, etc. This article wouldn’t have been possible without the insights and comments from quite a few folks familiar with GPUs and other parallel architectures:

  • The GT200 team, NVIDIA
  • James Wang, NVIDIA
  • John Nickolls, NVIDIA
  • Tim Murray, NVIDIA
  • Andrew Humber, NVIDIA
  • Bryan Catanzaro, UC Berkeley
  • The presenters at SIGGRAPH’s Beyond Programmable Shading session
  • The PR team from NVIDIA’s GT200 editor’s day – Bryan Del Rizzo, Sean Cleveland, Brian Burke, Derek Perez and Rick Allen
  • Rys, Uttar and the others from B3D

Pages: « Prev  1 2 3 4 5 6 7 8 9 10 11 12  

Discuss (72 comments)