AMD’s Cayman GPU Architecture

Pages: 1 2 3 4 5 6 7 8 9 10 11

Physical Implementation

Cayman is a 2.64B transistor design, manufactured on TSMC’s high performance 40nm process. Cayman’s die size has increased in tandem with the transistor count; it weighs in at 389mm2, compared to the 336mm2 Cypress, while Fermi is estimated to be ~550mm2. There is a very clear trend that AMD GPUs have grown in die size, at ~20% per generation for the last 2-3 years. The motivations are twofold. First, AMD must maintain a competitive product portfolio against Nvidia’s much larger high-end GPUs. Second, increasing die area is generally a more power efficient way to improve performance for graphics, as opposed to increasing frequency. The most power efficient way to improve performance is via fixed function hardware, but the trend for GPUs is towards generalized programmability and AMD is already much heavier on dedicated hardware than Nvidia (although less so than Intel’s integrated GPUs).

Power consumption has been a gating factor in CPU design since the 90nm node, with careful optimization of both dynamic and static power consumption. Discrete GPUs are much less power constrained. A commodity x86 CPU might draw about 130W – a limit imposed by the cost of commodity heat sink and fan combinations. High-end GPUs have steadily been increasing power consumption and currently reach in the neighborhood of 300W. However, the power consumption cannot increase forever, and 300W appears to be a limit for most GPUs – even considering the exotic cooling that is already used. GPUs recently started bumping into thermal limits, while CPUs hit them around 2004. As a result, discrete GPUs are several years behind CPUs in the latest power saving techniques, but they are now starting to explore similar techniques to grapple with the limit on power and thermal headroom.

Previous AMD GPUs relied on power states that are controlled by the graphics drivers and the hardware. The driver is configured with a number of profiles that correspond to different usage models (e.g. multi vs. single monitor display, 3D graphics vs. 2D, hardware decoding). Each profile can have 3 different power modes (low, medium and high). In turn, each of these modes can run at a different voltage and frequency (although not all GPUs take advantage of this). The driver contains all the configuration details for each mode, but the hardware is actually responsible for determining whether to run at high, medium or low power.

Cayman enhances the existing system so that the hardware power controller can dynamically adjust the frequency within a given power mode. The power management system is a simpler version of the one described at ISSCC for AMD’s 32nm Llano processor. The power controller calculates power draw by estimating the chip wide capacitance based on sampling a large number of performance counters over time. Based on the estimated power draw, the GPU will periodically adjust the frequency (but not the voltage) to stay within pre-set limits on power draw and thermal dissipation.

The power and thermal limits are software configurable – both through the BIOS and through AMD’s OverDrive Utility. However, using the latter to run at higher thermal levels will likely void the warranty. Since the GPU will adjust to stay within the limits, frequency guard banding can be relaxed and clock speed increased. Specifically, AMD does not need to bin GPUs based upon worst case workloads, and can instead relying on the throttling to correctly handle power viruses and other exceptional workloads. Because the power controller is software transparent and doesn’t rely on recognizing applications in the driver, it is also scalable to new applications. The benefits are very application dependent, but AMD indicated that typical gains are in the 10-20% range for frequency.

In setting the limits, AMD had to determine which workloads would run at full frequency, and which ones would end up throttling to a lower clock speed. To maintain credibility for the rated frequency and avoid a fiasco like the Cyrix “Pentium Rating”, AMD needs to ensure that most real applications do not throttle below the rated clock frequency under normal conditions. The crux of the dilemma is the definition of ‘most real applications’. Does ‘most’ mean 66%, 80%, 95% or something else? While customers should ultimately care about performance, frequency is a key determinant and an aspect that is actively advertised and widely understood. AMD indicated that their goal is to avoid throttling any real world applications, although the limits may vary by product family.

The power management has several applications beyond simply raising the base clock. OEMs or end-users can configure cards to run quietly, by capping the power dissipation at a low level. In general, passive cooling is sufficient for graphics cards that dissipate ~130; which means about 100W for the GPU itself and the rest for GDDR, VRMs and other components. OEMs can also produce specialized graphics cards that will fit within different power requirements – for instance tailored to certain notebook configurations.

Cayman uses a tremendous amount of on-chip storage, which has a critical impact on performance, power and area. Each of the 24 SIMDs has a 256KB register file, 32KB LDS and 8KB L1 texture cache. Shared across the chip are the 512KB L2 texture cache, 64KB GDS, 32KB write combining cache and 128KB for the read/write (or color) cache. Altogether that is a total of 7840KB of data storage arrays – and this isn’t even counting the arrays used for instruction caching. The storage arrays are all designed using a custom memory compiler targeted at TSMC’s 40nm process and do not implement ECC. The cost of ECC is relatively high in terms of power and area, and more importantly, for graphics workloads, errors are determined by visual acuity, rather than bit for bit accuracy. Eschewing ECC for SRAM is an example of how AMD has balanced the competing needs of the graphics and compute markets – and focused on low hanging fruit.

AMD has not announced any Cayman products using ECC for the external memory. However the memory controller is capable of using regular GDDR5 DRAMs to emulate ECC – similar to the approach Nvidia has taken with Fermi. Ordinarily with an ECC memory interface, 9 DRAMs are used instead of 8 – both for the extra storage capacity and extra bandwidth. Emulating ECC in a GPU will sacrifice memory capacity and memory bandwidth (to accommodate the extra storage and data transfer needed for syndrome bits). Moreover, it’s not clear that ECC would substantially improve the overall reliability of AMD GPUs, as opposed to fulfilling a marketing check-list. SRAM in advanced process technologies is much more susceptible to soft errors than DRAM, and AMD GPUs do not use any protection for on-chip storage. Parity is sufficient to detect soft errors in read-only caches, which can reload good data from memory. However, memories for modified data, including the register files, write combining caches and atomic caches all need some form of protection – whether it is ECC, highly stable storage cells or something else.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10 11   Next »

Discuss (44 comments)