Computational Efficiency in Modern Processors
The computer industry is on the cusp of yet another turn of the Wheel of Reincarnation, with the graphics processor unit (GPU) cast as the heir apparent of the floating point co-processors of days long gone. Modern GPUs are ostensibly higher performance and more power efficient than CPUs for their target workload, and many companies and media outlets claim they are leaving CPUs in the dust. Is this really the case though?
GPUs of today are monstrously powerful compute devices for explicit and embarrassingly parallel workloads. They are multi-core processors with 10 (AMD) to 30 (Nvidia) cores per GPU, and each core executes extremely wide vectors. The Nvidia GT200 is capable of 933 or 622 GFLOP/s single precision (SP), depending on how you count, and 77 GFLOP/s double precision (DP). The competing AMD RV770 can execute up to 1.2 SP TFLOP/s and 240 DP GFLOP/s. In contrast, a high-end CPU like Nehalem can achieve roughly 102 GFLOP/s and 51 GFLOP/s for single and double precision respectively.
Raw performance numbers are impressive, but not necessarily a good metric for comparison. Unconstrained performance is rarely of interest – servers have certain power, cooling and cost limits, as do notebooks and cell phones. Programmability is another key element, although harder to quantify. The CELL processor was the first in the most recent cohort of co-processors (and while it doesn’t run DirectX or OpenGL, it is conceptually similar to a GPU). It offered amazing performance, but uptake was limited by a positively hostile programming environment that lacks almost all the niceties which developers are accustomed to.
Rather than consider performance alone, a more interesting comparison is to examine the performance/watt and performance/mm2 of silicon – thus taking into account the huge power consumption (over 300W in some cases) and die area for GPUs. We will make a rough comparison of modern processors including CPUs and GPUs on 65nm, 55nm and 45nm. For performance, we have generally chosen to use the theoretical peak DP FLOP/s for the second highest or highest frequency bin of a chip. Double precision (64-bit) is the standard for most computing; 32-bit operations are only useful for very specific workloads. The TDP for that frequency bin is taken to be the watt number, representing power and thermal consumption, and the die area for that chip is used as well. Chips where one or more piece of information was missing were omitted from our line-up (in particular, there are no entries from IBM’s POWER line and relative few for GPUs either).
This is not a true ‘apples to apples’ comparison, but a start in the right direction. Theoretical peak FLOP/s is not a great metric because it does not consider utilization in real software, nor does it reward investments in ease of use and programming. In an ideal world, we would measure performance and energy consumed on a given set of benchmarks. However, choosing a set of benchmarks is very complicated, especially given the myriad of different instruction sets and capabilities. There are standards for comparing CPUs, e.g. SPECcpu2006, and there are some standards (albeit somewhat dodgy) for comparing GPUs – but nobody has established a consistent and fair way to compare CPUs and GPUs. Additionally, getting access to all this hardware to run a set of benchmarks would be very challenging. Rather than delving into that morass of complexity, we instead opted to focus on a simpler performance number that is tied to physical quantities (frequency and execution width) and hence readily available. Other complicating details include:
- Some GPUs and Cell do not fully support the IEEE 754 double precision standard.
- GPUs and CPUs typically require additional chips to make a complete system. For instance, GPUs need host processors, some CPUs need external memory controllers or caches. We do not estimate the area and power costs of these supporting chips.
- GPUs and CPUs use different process nodes which are not always directly comparable, and process technology heavily influences power and density.
- GPUs have a very restricted programming model; they do not run certain workloads, cannot boot an operating system and require a host processor.
- GPU power numbers may be system level and may include graphics DRAM or other components.
- Some CPUs and GPUs have exotic or expensive cooling systems, which substantially lower power consumption by reducing junction temperature and leakage, but add cost.
- Server CPUs have high capacity memory systems, high bandwidth coherent interconnects and large caches to enable scalable systems; these don’t contribute to FLOP/s and cost power and area but are essential for key workloads (ERP, OLTP, etc.).
- CPU and GPU vendors measure TDP or power differently in some cases.
- CPUs (especially for servers) have much more extensive reliability features such as error correction than current GPUs.
- Performance/watt and performance/mm2 vary with product SKUs and frequency bins.
With those caveats in mind, the chart showing our data is on the next page.