AMD’s overall strategy with their GPU line is to address the ‘sweet spot’ of the market with a single compact GPU, while relying on multiple-GPU solutions to scale up further. This is a distinct contrast to Nvidia, which has aggressively and successfully aimed for the largest and highest performance single GPUs. This measured approach to GPU design is a conscious trade-off to achieve higher yields, lower costs and better margins, at the price of losing the marketing halo of the highest performance single GPU. One other catch is that while graphics is an embarrassingly parallel application, not all games can effectively use multiple GPUs. Thanks to the efforts of application engineers though, most major titles can use more than one GPU.
As shown in Figure 1, Cayman’s system architecture is very similar to its predecessor (Cypress), but modestly scaled up. Cayman bumps up the frequency by 30MHz, but more importantly it packs in 24 cores (which AMD calls a “SIMD”), slightly more than the 20-core Cypress. However, the nature of those cores has substantially changed. Each Cayman core is a 16-wide SIMD processor and each SIMD lane is a 4-wide VLIW (or VLIW4), for a total of 64 execution units. In contrast, all of AMD’s other DX10+ GPUs used a VLIW5 as the basis for the cores. The SIMDs each have a dedicated 8KB L1 texture cache, and share a 512KB L2 texture cache. These caches are not coherent in the usual sense, as they are read-only.
Figure 1 – Cayman System Architecture and Comparisons
The L2 cache is actually partitioned into eight 64KB slices, with one slice per memory channel. The memory interface is a total of 256-bits wide and each memory controller drives 2 channels of GDDR5 memory at up to 5.5GT/s for an aggregate bandwidth of 176GB/s. The memory interfaces are about 12% faster than Cypress, due to several improvements. The GDDR5 training algorithms, which compensate for imperfections in the memory interface, have been tuned. Simultaneously, AMD has tightened board and package specifications to improve signal quality and also improved the physical layer design. The memory controllers support clamshell mode, where two x16 DRAMs are used per channel, rather than a single x32 DRAM. The normal consumer products have 2GB of GDDR5 memory, while any workstation or compute versions will probably support 4GB (using clamshell or higher density DRAMs). The memory interface use GDDR5’s CRC and retry mechanism to detect and resolve transmission errors.
One of the constraints of the ‘sweet spot’ approach is that AMD is limited to narrower memory interfaces. A 384-bit or 512-bit wide memory interface is certainly feasible for a single GPU. However, it becomes a board design nightmare for dual-GPU graphics cards due to routing congestion between the GPUs and all the DRAMs. This increases the number of layers needed in the PCB for the board and the number of DRAMs, therefore increasing costs and design effort.
Cayman retains the PCI-Express 2.1 interface to the rest of the PC, as the PCI-E 3.0 specification will not be finalized till the end of 2010. However, the DMA controllers, which actually manage communication with system memory (over PCI-Express) were overhauled. Cypress featured two DMA controllers, each one controlled transfers in a single direction. The Cayman DMA controllers are both bidirectional, which increases the realizable bandwidth for the PCI-Express link.
In comparison, Nvidia has a more complicated clocking architecture across their chips. The Fermi shown above is the C2050 or C2070 model. It has a fast clock of 1.4GHz for the cores (or SMs), while the fixed function hardware runs at the base clock, and the L2 cache and ROPs run at 600MHz. One of the advantages of this multi-clock domain architecture is that Nvidia can scale up shader, memory or graphics performance (i.e. SMs) independently. The downside is more complex clock distribution and buffering between different clock domains.