Compute Efficiency
Chart 2 below shows the performance per watt and performance/mm2 of silicon for various CPUs and throughput devices. Conventional microprocessors are shown with diamonds, while squares denote throughput processors. The color indicates the process technology, with blue representing 22nm, green for 32nm, yellow for 40nm and brown indicating 45nm.

Chart 2. Compute Efficiency in 2012 Processors
The throughput processors include two 40nm GPUs, Nvidia’s Fermi (Tesla M2090) and AMD’s Cypress (FireStream 9370). The former is a 16-core design based on the Fermi architecture, which is the standard-bearer for programmable GPUs. Fermi has a unified memory space, coherent caching, half-rate double precision, ECC protection for SRAM arrays. Fermi can enable ECC memory, although the capacity and bandwidth are reduced by around 16% when active. The Cypress architecture is less compute-optimized; it retains a graphics-focused split address space, without coherent caching or error protection. Each of the 20 VLIW5 cores performs two double precision computations per cycle. AMD’s solution conserves substantial area by eschewing various programmability features and using the very dense VLIW5 core, and is roughly 30% denser than Fermi, although about 20% less power efficient.
IBM’s Blue Gene/Q and Oracle’s T4 are somewhat less focused. Both feature relatively large last level caches, and in the case of the T4, out-of-order execution. However, the architectures were clearly intended to emphasize throughput, rather than per-core performance. The Blue Gene/Q offers the best performance/watt, nearly 25% more efficient than the closest alternative, despite using a slightly less advanced process technology. The power efficiency is largely due to aggressive water cooling (which reduces static power) and operating at 1.6GHz at 0.8V (which lowers dynamic power). However, the density is half of Fermi’s. Only 16 cores are available for computation, with two cores reserved for the OS and as a spare to handle manufacturing defects. Unlike a GPU though, Blue Gene/Q features a 32MB eDRAM L2 cache that occupies 31% of the chip, substantial networking and high speed I/Os.
The T4 is the least impressive, based on FLOPs/mm2 and FLOPs/W, because FLOP/s were not a principal design goal. The T4 is optimized for commercial server workloads such as databases and Java applications; it includes a large number of high-speed memory and coherency links, while operating at high frequency. Unlike the Blue Gene/Q’s 4-wide SIMD unit, it has no double precision vector instructions and the 8 cores can only perform a single fused multiply-accumulate per cycle.
The CPUs shown are a relatively homogeneous group from AMD, Fujitsu, IBM and Intel. They are all out-of-order designs with fairly high single threaded performance, and generous caches. Unsurprisingly, the designs are clustered relatively close together, with a few exceptions.
Fujitsu’s SPARC64-VIIIfx is particularly outstanding for power efficiency, matching AMD’s Cypress. The reasons for this are fairly similar to the factors at play for Blue Gene/Q. As an earlier article on the Kei Supercomputer discussed, the water cooling substantially reduces leakage power. Additionally, the 8 CPU cores run at 2GHz (lower than the 2.5GHz predecessor), in order to keep the voltage under 1.1V and reduce dynamic power. The area efficiency is relatively unremarkable, roughly in the middle of the pack for the CPUs.
The quad-core Ivy Bridge (E3-1280V2) also stands apart, demonstrating the benefit of Intel’s 22nm FinFET process. As expected, the density is outstanding; 60% greater than the closest CPU and 28% better than the throughput optimized Blue Gene/Q. Since this is a derivative of a client design, there is no area allocated to coherent interconnects, but a large portion of the die is taken up by the integrated Ivy Bridge GPU. If the graphics were removed, the die size would decrease by around 38%. The power efficiency is modestly improved over 32nm products (~20%). However, this does not show the full potential of the 22nm process because Ivy Bridge operates at a high voltage to achieve 3.6GHz and the gains from FinFETs are most pronounced around 0.7V.
The main cluster of CPU results include a variety of 32nm and 45nm server products. The 8-core Sandy Bridge-EP (E5-2690) stands out as the most attractive. This is not surprising, since the Sandy Bridge microarchitecture incorporates 4-wide execution units for AVX and impressive frequencies. Intel’s Westmere-EP (X5672) and Westmere-EX (E7-8867L) lack that benefit, which clearly shows. The Westmere-EX has the worst overall efficiency of the CPUs, reflecting the costs associated with scalable servers such as large caches and multiple coherency links.
AMD’s Interlagos (Opteron 6278) takes a very different approach, due to the unique architecture of Bulldozer. The 16-cores share 8 floating point clusters, which are capable of two double precision multiply-accumulates per cycle. This is half the throughput of Sandy Bridge-EP, although with twice the cores and AMD achieves comparable performance/watt. The downside is that the high core count translates into lower frequency and huge caches (32MB total L2 and L3), which hurt performance/mm2. AMD’s 12-core Magny-Cours (Opteron 6176) has similar challenges for area efficiency, but far less cache (18MB total).
Last is the impressive 45nm POWER7, an 8-core monster that dissipates 250W at 4GHz. More so than any other CPU, the POWER7 spends tremendous resources on coherency, memory and I/O. Despite this focus on large systems, the performance is incredibly good. The main factor is IBM’s 32MB eDRAM L3 cache, which saves significant area and power compared to the SRAM implementations in other CPUs. The power efficiency is very good, due to aggressive physical design. The POWER7 has a whopping 23 frequency and 63 voltage domains across the chip, with 3 voltage domains per-core (core, L2 cache, local L3 eDRAM). In contrast, commodity CPUs have 4-7 voltage rails.
Discuss (133 comments)