Nvidia’s system architecture continues to focus on the highest performance for a monolithic GPU, and nearly filling the reticle for TSMC’s lithography systems. This is driven by three factors. From the compute standpoint, trying to program two GPUs in tandem is extremely inconvenient, so maximizing the performance of a single GPU is essential for programmers. Additionally, the competition for high performance computer (HPC) systems is not a single CPU, but rather a dual socket HPC oriented system with 4-6 cores per socket. Building massive GPUs enables Nvidia to retain the crown of highest single GPU performance, which is perceived to have a ‘halo’, some marketing value which extends across the whole product line (even low end cards). The downside of course is that large chips are not cheap to manufacture as yields drop non-linearly with die size. Figure 1 below shows the system architecture for Fermi, GT200 and G80. Despite the labels, G80 only has texture caches, just like GT200.
Figure 1 – Fermi and GT200 System Architecture
Fermi’s system architecture is much cleaner and more elegant than the current generation, where three cores shared a memory pipeline. Fermi is 16 core device, but each core (or SM in Nvidia parlance) has grown substantially with 32 functional units and a semi-coherent L1 cache that can either be 16KB or 48KB in size. These are indeed semi-coherent caches, unlike current and prior generations, although the details will be touched upon later in the appropriate section. The L1 caches are tied together with a shared 768KB L2 cache, which is the memory agent and can assist with some synchronization.
Fermi has six GDDR5 memory controllers, each operating two channels of memory. The memory interface is 384-bit wide and can also be configured to access DDR3 memory. Since Nvidia is not disclosing any product or operational characteristics of Fermi, we must make some educated guesses regarding memory bandwidth. At GDDR5 speeds of 3.6gbps, this would yield 172.8GB/s of memory bandwidth, barely edging out AMD’s latest offering. A DDR3 memory interface operating at the standard 1.6gpbs rate would yield just 76.8GB/s, while a 1.575 volt, 2gbps interface would achieve 96GB/s. Something to keep in mind though, is that compute oriented GPUs typically operate their memory interfaces at lower frequencies – Tesla peaked at 1.6GT/s GDDR3 versus 2.2GT/s for consumer oriented products, likely for stability and capacity reasons.
Using x16 DRAMs in a single rank, Fermi can access 24 devices. In theory, this means that Fermi will support up to 6GB with 2GBit DRAMs and 12GB with 4GBit devices. At the moment though, there are no such DRAMs in production, only 1GBit offerings from Qimonda, Samsung and Hynix. Qimonda has publicly announced they will ramp 2GBit devices some time in 2010 though, and expressed interest in 4GBit DRAMs further out on their roadmap. Another alternative to increase capacity would be using two ranks of DRAMs, although this would likely reduce the operating frequency.
The external interface to the rest of the system is unchanged, still relying on PCI-Express gen 2, but the controller logic has improved. Fermi can now simultaneously transfer data to and from the host, which enables computation on the GPU to be better pipelined with respect to the rest of the system.