In an earlier article, we evaluated the compute efficiency for a variety of processors available in 2009, including both leading CPUs and throughput devices (e.g. GPUs and the Cell processor). The majority of these designs were manufactured on 65nm CMOS, although some GPUs were using TSMC’s 55nm process and select Intel and AMD processors were available on 45nm.
At the time, GPUs were emerging as potential compute devices, with vendors such as Nvidia claiming orders of magnitude better performance than conventional solutions. In part, this advantage was due to rather different power budgets. GPUs plateaued at roughly 250-300W, whereas few commodity CPUs exceeded 130W. Similarly, high-end GPUs tended to max out the available die area, while CPUs were constrained to less area. To make a reasonable comparison between different CPUs and GPUs, our analysis focused on efficiency and normalized performance by the area and power consumption. Chart 1 below shows the data from 2009.
Our analysis showed that modern GPUs were generally more efficient than CPUs, as measured by FLOPs/W and FLOPs/mm2. However, the difference was very microarchitecture-specific. Some GPUs were less efficient than CPUs on the same process technology (e.g. Nvidia’s GT200 compared to Intel’s Nehalem), while other GPUs were far more efficient (e.g. AMD’s RV770) and clearly stood out.
Since our first look at compute efficiency, the industry has changed dramatically and it is appropriate to re-evaluate our analysis. At the time, the programmability of GPUs was quite limited. For instance, some GPUs did not have double precision floating point support, and those that did suffered large performance penalties. The APIs for targeting GPUs were proprietary, immature or non-existent and the memory model for GPUs was particularly poor, without unified address spaces or coherent caching. Reliability was also questionable; GPUs lacked error protection for on-chip SRAM as well as DRAM, meaning that they could not be used for many workloads.
The comparison is more straight forward in 2012 than it was in 2009. GPUs are a much more acceptable part of the computing landscape and are at a mid-point in terms of maturity. Almost all GPUs have very reasonable double precision performance and new features to enhance programmability through APIs such as OpenCL. Some, but not all, GPUs, offer coherency and unified addressing along with ECC-protected memory and caches. Nvidia has generally been far more aggressive about such features than AMD, although the gap will close in the next generation. Perhaps more importantly, there are specific GPU models that are targeted for compute workloads (rather than graphics). At the same time, CPUs have evolved to take advantage of new vector extensions, such as the 256-bit AVX, which improve overall compute efficiency.
There are still a number of factors that make comparisons imperfect:
- GPUs and CPUs typically require additional chips to make a complete system. For instance, GPUs need relatively high performance host processors and CPUs need I/O hubs.
- GPU programming models are still restrictive compared to CPUs.
- GPU power often includes board-level components such as memory and VRMs; CPUs do not.
- Cooling systems vary tremendously between different processors, which can reduce power consumption but raise system cost.
- Server processors tend to invest considerable power and area into coherency, memory capacity and large caches to enable scalability, whereas client CPUs and GPUs do not.
- Most importantly, theoretical compute power does not translate into actual workload performance. CPUs have a much higher utilization than GPUs.
Despite these complications, the comparison is useful to see how the computing ecosystem has evolved over the last 3 years. The previous analysis suggested that the gap between CPUs and throughput processors would narrow. At the same time, new processors such as Intel’s Larrabee would introduce more variation in throughput devices.
The underlying manufacturing has changed for the industry as a whole. In 2012, a cutting edge CPU from Intel might use the 22nm FinFET process. Most CPU designs though are using 32nm or 45nm process technology, some taking advantage of high-k gate/metal gates and some using conventional transistor stacks. Throughput processors (such as GPUs from AMD and Nvidia) that are suitable for serious computational workloads are using TSMC’s 40nm process. Chip designers have leveraged Moore’s Law to improve absolute performance. In 2009, a high-end CPU offered around 50 GFLOP/s and the best GPU could achieve 250 GFLOP/s. In thee years, the theoretical performance has increased by roughly 2× for GPUs and 3× for CPUs, hinting that the gap might be closing.
To level the playing field, our analysis focuses on processors that are suitable for computational workloads. In practice, this means server processors and compute-specific GPU models. As with our previous work, compute performance is measured as the theoretical peak double precision FLOP/s for a single socket. The FLOP/s are calculated using the base frequency, and ignore any potential frequency uplift from aggressive power management. This approach is justified because a workload that achieves peak FLOP/s is likely to leave little thermal headroom. The CPUs shown are the highest (or very close to the highest) performance bin.
Discuss (133 comments)