Chart 1 below shows the performance per watt and performance/mm2 of silicon for various CPUs and GPUs. To help make more sense of all this data, and highlight key differences, GPUs are marked with squares and CPUs with diamonds. 65nm products are shown in orange, 55nm in green and 45nm in blue.
Chart 1 – Performance Efficiency in Modern Processors
The most remarkable aspect of Chart 1 is that nearly all processors are clustered in the same rough area, with four striking exceptions. The RV670, RV770, PowerXCell 8i and Silverthorne all stand out as vastly more efficient than other computing devices. These outliers and the scale of the chart make it a little hard to drill down and see where everything else stands, so Chart 2 below removes these four outliers to zoom in on the majority of the processors.
Chart 2 – Performance Efficiency in Select Modern Processors
These two charts provide some rather interesting insights into the performance efficiency for GPUs and Cell, which tend to be on older 65nm and 55nm processes. ATI nee AMD’s 55nm RV770 is clearly the top of the heap with about 3x better performance/watt and 4x better performance/mm2 than the best high performance CPUs and Nvidia’s GT200b. One advantage for the RV770 is faster GDDR5 memory, which reduces the area used for external interfaces substantially. The architecture of the entire RV770 family (including RV670 as well) emphasizes density. Front-end logic for control flow, instruction fetch and decode and the shared memory is used more sparingly than other GPUs; there are only 10 cores (called SIMDs) – each providing 24 GFLOP/s. Rather than spend hardware resources, almost all scheduling is exposed and pushed to the compiler and programmer; within each SIMD is an array of VLIWs which require a certain degree of instruction level parallelism (ILP). The one downside of this approach is that utilization on real workloads tends to suffer, which is not apparent in our analysis, but worth highlighting. The 55nm RV670 is less impressive, but still very good. The cores are less dense and it uses more die area for GDDR3 memory interfaces.
The PowerXCell 8i looks quite good, given that it was built on an older 65nm process. This is largely because the SPEs don’t spend power or area on basic programmability aids like instruction caches, scheduling or other niceties; these programming challenges also explain why there are no 45nm versions except for the PS3 Slim. Achieving good utilization of the SPEs is even more difficult than for the RV770 family, although PS3 developers have had quite a while to work on this problem.
Surprisingly, Nvidia’s 55nm GT200b is pretty much identical in performance/watt and performance/mm2 to Intel’s Nehalem, which is likely the best 45nm CPU. This is hardly a bad showing, as Nehalem is a finely tuned design that puts almost all other CPUs to shame, but the GT200b is far less efficient on paper than the RV770 at the same process node. The 65nm GT200 is not as competitive and trails two 65nm CPUs (Barcelona has higher performance/watt and Merom has higher performance/mm2).
This gap between the two major GPU families can be explained in several ways. Both Nvidia GPUs are using GDDR3, which requires substantially more area and power for a given amount of bandwidth. Nvidia’s architecture has more cores (30 SMs, as they are known), each with its own front-end and shared memory. Moreover, each core includes considerable scoreboarding logic for dynamic scheduling. While this control logic uses power and area, Nvidia’s cores do not require much ILP within a given data stream, whereas that is necessary to utilize AMD’s VLIW approach. Each core has dedicated double precision hardware capable of 3 GFLOP/s, which costs more area and power than sharing single precision execution resources. Another factor may be that Nvidia’s cores have longer pipelines to achieve higher frequency, thus more latches to hold intermediate results.
The last processor that deserves a lot of attention is Intel’s 45nm Silverthorne, now known as Atom. While it isn’t very area efficient, it has an exceptional performance/watt, only a little behind RV770 and exceeding all other devices. The extent of the power efficiency was relatively unexpected; Silverthorne is an in-order core, leaving out some of the most expensive control logic found in Nehalem or Shanghai, so it should be more power efficient than out-of-order CPUs. But it still has many expensive structures common to CPUs – branch prediction, x86 decoders and microcode, x87 support, etc. and it only executes two scalar double precision operations at a time. So it was somewhat surprising that Silverthorne’s performance/watt was 3x better Nehalem and GT200b.
The rest of the data points are interesting, but less relevant to the question at hand – the efficiency of GPUs and CPUs.