NVIDIA’s GT200: Inside a Parallel Processor

Pages: 1 2 3 4 5 6 7 8 9 10 11 12

Physical Implementation

The vast majority of the GT200 is designed using a standard-cell approach, although the SPs use semi-custom design with manual place and route. The chip boasts 1.4B transistors (compared to 681M for the prior generation) primarily used for combinatorial logic, rather than storage arrays. The GT200 is fabricated in TSMC’s 65nm generic process (65G) and is the largest design taped out at TSMC. The die size has been confirmed as 583.2mm2 and NVIDIA acknowledged during the Editor’s Day briefing that the reticle limited them from using 32 SMs in the design. Figure 8 below shows a die micrograph, along with labels for different regions of the chip.

Figure 8 – GT200 Die Micrograph

The GT200 features three major independent clock domains: graphics clock, processor clock and memory clock. The graphics clock is the primary domain and encompasses the texture pipelines, the ROPs and all of the SMs except for the functional units. The processor clock is about twice the frequency of the graphics clock and is used only for the SM functional units. Lastly, the memory clock is used for the GDDR3 memory controllers and the PHYs to the DRAM. All three clock domains run asynchronously with FIFOs between them to absorb skew – so the graphics clock has no precise relationship to the processor clock, just an approximate ratio.

One of the areas that NVIDIA put more focus on in this generation was power saving. While this statement seems ironic given that high-end GeForce GTX 280 boards consume 100W more than the previous generation, it is nonetheless true. Fully loaded power consumption increased substantially, but NVIDIA reduced the power consumption when idle using traditional techniques such as clock gating and voltage/clock scaling and introduced several power states for partially loaded situations (e.g. when only the dedicated video decoding hardware is active). Idle power went down to 25W (from 45W in the G92) and decoding Blu-ray also dropped about 20W, down to 32W.

Since peak power and the memory interface width increased from the previous generation, it should come as no surprise that the pin count did as well. The GT200 uses a 2236 ball BGA package for power and ground, memory and I/O (allocated in that order), a bit less than double the 1447 used in the G80.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10 11 12   Next »

Discuss (72 comments)