NVIDIA’s GT200: Inside a Parallel Processor

Pages: 1 2 3 4 5 6 7 8 9 10 11 12

System Architecture

One of the most substantial differences between CPUs and GPUs is that fundamentally a GPU is a complete system that is designed in tandem with a very specific memory sub-system and offers almost no flexibility in terms of configuration. In contrast, the CPU’s memory sub-system is designed for extreme post-manufacturing flexibility so that users can expand and upgrade their memory capacity on the fly. This means that CPU memory systems must use DIMMs, which have poor electrical characteristics and therefore performance as compared to mounting DRAMs in a PCB directly, as is done in graphics cards.

The GT200 is designed as a monolithic GPU that starts at the extreme high end of the market and will eventually cascade down across all product lines with successive compactions, just like the G80. This is in direct contrast to ATI’s strategy, which is more focused on the volume performance segment. ATI’s RV770 addresses the performance market, but requires two dice in a card for the highest performance.

There are definite trade-offs to each approach. Using a single monolithic die should lead to a performance advantage for graphics, but with lower yields and higher unit costs. Additionally, a very large die area GPU (such as the GT200) cannot span as much of the market as a smaller die and a dual die card. Ultimately for graphics, the question of monolithic integration versus dual-die packaging is pretty ambiguous. There are advantages to each and it really depends on the implementation.

For general purpose computing, the answer is much more clear cut. A single monolithic GPU is much more useful than two smaller GPUs packaged together. In the world of CPUs, multi-processing is a relatively small change in the programming model. CPUs already have coherent caches (coherent with respect to I/O), so making CPUs cache coherent with each other is a small change. In the world of x86, multi-processors have been common since the P6. In contrast, GPUs eschew the overhead of coherency for caches in a single chip. Since there is no coherency even on a single GPU, there is certainly no coherency between multiple GPUs. This means that there is no way to use multiple GPUs for a general purpose application, unless the developer is willing to manually manage the sharing of data and communication; and while CUDA is an excellent programming model, it does little help developers here. Undoubtedly, NVIDIA’s interest in GPUs as a computational device was a key motivator for NVIDIA to pursue a monolithic GPU and is a tangible demonstration of the importance of compute oriented GPUs for NVIDIA.


Figure 3 – G80, GT200 and Niagara 2 System Architecture

Figure 3 above shows the system architecture of three throughput oriented processors, the G80, the GT200 and Niagara II. Note that the caches in the two GPUs are read-only texture caches, rather than the fully coherent caches in Niagara II. The GT200 frame buffer memory interface is 512 bits wide, composed of eight 64 bit GDDR3 memory controllers, compared to a 384 bit wide interface on the previous generation. The memory bandwidth varies across different models, but peaks at 141.7GB/s when the memory controller and memory are running at 1107MHz, approximately 65% higher than the previous generation. On top of a wider and higher bandwidth memory interface, the GDDR3 memory controller coalesces a much greater variety of memory access patterns, improving the efficiency as well as peak performance. The memory controller is configured with 1GB of memory for performance sensitive consumer applications (i.e. games), but for professional applications can be expanded up to 32 DRAMs (4GB) at reduced bandwidth (800MHz operation). As with most GPUs there is no ECC for the memory, which pretty much precludes any possibility of large scale GPU clusters due to reliability issues.

The external interface from the GT200 to the host system was also upgraded, doubling bandwidth by moving to a PCI-Express Gen 2 x16 slot with a theoretical 8GB/s of bandwidth in each direction; when considering the PCI-E packet overhead the effective maximum is roughly 5-6GB/s.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10 11 12   Next »

Discuss (72 comments)