AMD’s Cayman GPU Architecture

Pages: 1 2 3 4 5 6 7 8 9 10 11

Cayman Compute

While the initial products are targeted at the consumer, it is likely that professional products based on Cayman will be released for workstation graphics and compute. It remains to be seen whether such products will use ECC for memory, but it seems plausible. It is difficult to precisely evaluate Cayman as a compute platform – because the products released are oriented towards gaming. However, it’s possible to make some educated projections based on understanding the microarchitecture and comparing consumer variants. In general, the differences between the consumer versions should mirror the improvements in the compute products.

Figure 6 shows a Cayman and Cypress SIMD, plus a Fermi SM. From purely theoretical numbers, each Cayman SIMD lost 20% single precision FLOP/cycle compared to Cypress and retained the same double precision FLOP/cycle. Accounting for frequency and the extra 4 SIMDs, the SP FLOP/s stayed constant, while DP FLOP/s improved by 24%. In reality though, it is exceptionally difficult to achieve peak performance on the old VLIW5 microarchitecture – most of the time, the 5th issue slot was probably idle. In essence, AMD’s architects sacrificed theoretical performance to improve achievable performance by adding SIMDs. Overall FLOP/s could increase by 15-25%, although if the power management system must be more conservative in professional products those gains will be smaller, perhaps 10-20%.


Figure 6 – Cayman, Cypress SIMD and Fermi SM

Analyzing Cayman’s memory hierarchy is simpler, since there are fewer changes with respect to Cypress. The SIMD memory pipelines (i.e. register files, L1 texture caches and LDS) are modestly more efficient, due to better coalescing. The big improvement thought is the extra SIMDs. The L2 cache and ROPs will retain similar bandwidth to Cypress. Cayman’s memory controller has 15% higher bandwidth than Cypress, which should also be true for compute products as well.

Compute and workstation products need to be more robust than a gaming graphics card – which means slightly lower core and memory frequencies. Based on historical data, it seems plausible that a compute oriented Cayman might hit 800-850MHz core clock and 5.0-5.3GT/s for GDDR5. Considering both FLOP/s and bandwidth, Cayman’s overall compute performance should be roughly 10-20% higher than Cypress. However, the TDP is very challenging to predict. Professional products tend to have substantially more DRAMs, which increase power consumption noticeably. A GDDR5 DRAM could easily be 2-3W, so a reasonable projection for a 4GB Cayman compute product is ~275W. Considering efficiency, Cayman should be similar to Cypress in performance per watt and 10-15% ahead in performance per mm2 of silicon. Overall, Cayman’s area efficiency will be stronger than the power efficiency.

Comparing the compute performance for a Fermi-based Tesla and a hypothetical Cayman-based FireStream is much trickier because they are fundamentally different architectures. Moreover, there is a serious lack of anything resembling good benchmarks or performance data for compute workloads, period. Based on the architectural analysis, some conclusions are possible. In general, Cayman will be attractive for highly regular, single precision workloads, while Fermi (or a CPU) will be the best choice for most other applications.

The projected Cayman has 17% higher memory bandwidth, roughly 2.5X the raw single precision FLOP/s and 26% higher raw double precision FLOP/s than the Tesla C2070. However, AMD’s VLIW microarchitecture is inherently less efficient, and the memory hierarchy is also incredibly sensitive to the workload. For single precision applications that are primarily regular computation and regular memory access patterns, Cayman should have good utilization within each VLIW4 offer incredibly attractive performance. Even in the case of a bandwidth bound application, Cayman will be on-par or slightly ahead of the Tesla. For double precision though, Fermi is likely to be the higher performance option. The two GPUs have similar raw performance, but the flexibility of Nvidi’s architecture is likely to be higher performance. Modestly complicated workloads (regardless of precision) will also heavily favor Nvidia’s architecture. For instance, the caches and scalar vector lanes in Fermi can more effectively tolerate complex memory access patterns or code with little ILP. Of course, the most complicated algorithms (e.g. lots of communication or highly irregular control flow) are still best suited to a CPU, rather than a GPU.

Conclusions

Ultimately, the challenge for AMD’s compute hardware is not performance, or even efficiency. The challenge is the software: the ecosystem for OpenCL and DirectCompute on AMD GPUs. Both standards are relatively new and do not have large codebases or extensive infrastructure. The widespread adoption of AMD’s hardware will be directly related to the maturity of the compilers and developer tools. Because AMD relies on a VLIW microarchitecture, the utilization of the hardware (and thus performance) is heavily influenced by the compiler. Cayman should be a much easier compiler target than the previous generations, once they have worked through the first part of the learning curve. More importantly, AMD is still working to build a viable ecosystem around OpenCL and DirectCompute. For example, one of the first applications of DirectCompute on AMD GPUs is a Morphological Anti-Aliasing algorithm, implemented via compute shaders. This was done internally AMD, but the goal is really to enable third party developers and a solid implementation of DirectCompute (or OpenCL) is just the start. As discussed in our OpenCL article, the two standards are about 2 years behind CUDA. Historically, AMD has under-invested in their software ecosystem – although this is beginning to change now that the overall direction for compute APIs is clear.

The bottom line for compute is that Cayman will be the first GPU from AMD that has any OpenCL support close to launch. It represents the beginning for AMD’s journey down the path of general purpose computing, although there is a long road ahead. The first Cayman compute products will likely be attractive for very specific workloads and get some initial customers. However, AMD’s long term goal must be to enhance their architecture and software ecosystem so that a broad swath of applications can easily take advantage of their hardware. In the mean time, AMD has their hands on a solid graphics product that will serve them well.

Pages: « Prev  1 2 3 4 5 6 7 8 9 10 11  

Discuss (44 comments)