Introduction
In the last several years, graphics processors (GPUs) have become an increasingly important element of modern computers. The first GPUs were highly specialized and very limited off-load engines, that achieved much higher performance running 3D than a software rendering engine on the general purpose CPU. Performance translated into good graphics and a better gaming experience and the competition was and is fierce. Since 3D graphics is an inherently parallel workload, GPUs have been able to scale up performance as Moore’s Law grants a greater number of transistors to architects. GPUs have evolved specifically for graphics and graphics APIs such as DirectX and OpenGL, and the architectures take advantage of wide vectors, many threads and tens of cores to exploit the parallelism between pixels, vertices and other parts of the 3D pipeline.
In the last 5 years, GPU hardware has also become increasingly programmable, specifically starting with the DX10 generation. Recent designs resemble throughput oriented processors, rather than the application specific hardware common in the 1990’s. As always though, software has taken more time to catch up. For example, OpenCL – one of the two the industry standard APIs for executing general purpose throughput oriented workloads on GPUs has only recently become widely supported and it will take till the latter part of this year for Intel’s integrated graphics to add support. The interest in so called GPU computing is largely driven by performance and efficiency. For the right workloads, GPUs have higher peak performance than CPUs, partially due to the larger die area and substantially higher power budgets. As our previous analysis has shown, theoretical GPU performance efficiency can be 2-4X than CPUs (measured by performance/watt and performance/mm2). Of course, for poorly suited workloads, such as those with complex control flow, data flow or latency sensitivity, the performance and efficiency is absolutely awful.
The trend is clear – GPUs are becoming an increasingly important part of modern systems. As such, understanding the performance of GPUs is equally critical. However, modern GPUs are incredibly complex architectures, and their performance is equally complicated. As our reports on AMD’s Cayman and Nvidia’s Fermi show, the actual cores are very much akin to a simple CPU core with heavy use of vectors and multi-threading. The memory hierarchies are fairly different from CPUs and encompass an incredible degree of optimization to provide high bandwidth, at the cost of latency, which is hidden by the multi-threading. Performance is influenced by a huge number of factors. The microarchitecture of the shader cores (and the number of such cores) is a key component. The differences between AMD, Intel and Nvidia shader cores are quite profound – much greater than the distinction between most high performance CPU cores. External memory bandwidth and coalescing of memory accesses are extremely important for throughput. But many other factors come into play, such as the use of caches – whether read-only, write-only or general, coherent or incoherent – and the nature of the on-die interconnects (e.g. crossbar versus ring versus hierarchical). The remaining fixed function hardware for shader scheduling, rendering and texturing is also critical. Last but not least, drivers – which are essentially aggressive, just-in-time optimizing compilers – have a significant impact as well.
One of the ways to measure how well GPUs are understood is the degree to which the performance can be predicted. One of the main reasons to analyze an architecture (whether a CPU, GPU or whole system) is to understand how it will execute different workloads. In turn, this grants substantial insight into the various trade-offs that designers make to achieve performance and power efficiency and future products.
The key to accurately predicting performance and efficiency is being able to identify and isolate the variables that mostly strongly determine performance and build them into a predictive model. Of course these models vary in complexity; from simple back of the envelop calculations to moderately involved analytic formulas to the cycle accurate simulators that designers rely upon. Performance can be expressed in a fairly simple form with a few variables:

The more sophisticated models will use a variety of information about both the workload and underlying hardware to understand these variables. As a simple example, 16-wide vector instructions can substantially decrease the instruction count (by up to a factor of 16, but realistically much less) to execute a given program. More complex models might attempt to show how the IPC (Instructions Per Cycle) is impacted by factors such as control flow divergence, or explore how workload characteristics like anti-aliasing or anisotropic filtering alter performance. While not a substantial issue yet, a more advanced model would also account for the fact that frequency is dynamic and may be adjusted based on power draw (which is itself influenced by the IPC).
In this article, we will use a reasonable sized data set to build a model of AMD and Nvidia GPU performance for a graphics benchmark, check it against real results and analyze what it says about different architectures.