Looking at the data, the distinction between throughput (Blue Gene/Q, Fermi and Cypress) and traditional processors is consistent, but not as large as in our previous analysis. This falls into line with our earlier hypothesis that CPUs and throughput processors will converge and narrow the gap over time.
The best throughput processor (Fermi) has a 68% area and 77% power advantage compared to the best CPU (Ivy Bridge), despite using an older process technology. Excluding the GPU die area from Ivy Bridge, and the density advantage falls to a little under 10%. While the two design philosophies may be converging, they are also more clearly delineated as throughput processors have become more successful at handling double precision floating point. Unlike our previous analysis, there are no cases of GPUs that are less efficient than CPUs.
Interestingly, IBM’s Blue Gene/Q conclusively demonstrates that a CPU designed for throughput can match and even exceed the power efficiency of GPUs. There is still a gap in terms of area efficiency, but smaller than the data suggests given that Blue Gene/Q includes a large cache and robust interconnects that are not found in a GPU. This bodes relatively well for Intel’s Knights Corner, although the area problems might be worse with the overhead for x86 relative to PowerPC.
Another observation is that vector (and to a lesser extent multiply-accumulate) instructions have a huge impact on efficiency. On the same process, Sandy Bridge-EP has 1.44× and 1.7× better density and power efficiency than Westmere-EP. While there are many differences between the two, the 256-bit vector units are one of the largest contributors. Blue Gene/Q and Fujitsu’s SPARC64-VIIIfx also rely on vectors to boost raw computational performance.
The data also shows the very real overhead of building bigger systems. Large caches, high speed interconnects and additional memory controllers all consume space and power. Westmere-EX demonstrates these costs. Westmere-EP has roughly twice the compute density of its larger sibling. Magny-Cours employs a comparable number of cores, with slightly better power and similar area, but on an older process node. Altogether, the data suggests that the compute density penalty for high-end server designs is equivalent to about one process node.
The comparison between Interlagos and Magny-Cours is also fairly instructive. This transition combines a substantial architectural change and a shrink from a conventional 45nm process to a 32nm process featuring high-k/metal gate transistors. From a theoretical standpoint, the unusual philosophy behind Bulldozer seems to be a success. The performance density jumped by about 50%, while the total cache capacity increased by a similar factor. However, performance/watt only grew by 39%. In contrast, Intel claimed that the 45nm HKMG process provided a 30% decrease in active power. Given the numerous improvements in Bulldozer, it is hard to imagine that the new architecture only improved compute/watt by a mere 7%. This implies that Global Foundries’ 32nm process is leaving performance on the table, which is consistent with reports on the difficulties associated with the gate first approach to transistor formation.
Looking to the Future
In late 2012 and early 2013, there should be a number of new products that change the overall picture. While the first GPUs from AMD and Nvidia using TSMC’s 28nm process have already been released, the compute variants are still under development and expected at the end of 2012. Moving to a new process technology should yield at least 50% across the board. Intel’s Knights Corner will also arrive, the first opportunity to evaluate an x86-based throughput processor. Realistically, these new products should widen the gap between CPUs and throughput processors.
The other major development expected in the near future is the emergence of microservers based on extremely low-power x86 and ARM processors. Calxeda has already started shipping a server based on a quad-core A9, however the target market is really scale-out workloads with little floating point. Eventually, Calxeda will upgrade to the higher performance ARM A15 which supposedly can execute 4-8 FLOPs per cycle. Applied Micro is designing a custom ARM core that targets the server market, although it may not ship till late in 2013. The traditional x86 vendors are also expected to release servers based around low-power designs. Intel’s Centerton is slated for 2012 and AMD’s acquisition of SeaMicro will eventually yield optimized SoC designs. In theory, these offerings may shift the compute efficiency spectrum, by sacrificing single threaded performance to improve throughput. However, the results are far from clear at the moment and the next year should be quite interesting to watch while we wait to revisit this topic.
Discuss (133 comments)