What Do Overclockers and Supercomputers Have in Common?

Pages: 1 2

Heat, Performance and Power

It turns out that the overclocker’s conventional wisdom regarding cooling is firmly rooted in reality. The operating temperature can have a significant impact on the behavior of microprocessors and other semiconductors, in terms of performance, power consumption and reliability.

The truth is that cooler chips run faster. Both transistors and metal interconnects slow down at higher temperatures. As transistors heat up, resistance goes up, which lowers the current when the transistor is on (the drive current). The resistance of metal wiring (almost exclusively copper) also gets worse with temperature, as the copper atoms become more energetic and have a higher chance of colliding with electrons. Studies have shown that basic circuits (e.g. nand, nor, adder) slow down by 5-11% as the temperature rises from 0C to 125C, and the impact is even larger (up to 14%) for circuits using dynamic logic [1]. Realistically, a more effective cooling solution will not reduce the temperature of a chip to 0C, but it’s quite possible to keep things under 40C, and this could easily yield 4-8% faster circuits.

The impact of temperature on frequency is fairly modest, but the influence on leakage power is tremendous. As described in our earlier IEDM 2005 article, there are three types of leakage in modern transistors: junction, gate and subthreshold (also called short channel). Subthreshold is the dominant form of leakage in modern designs; high-k/metal gates largely eliminated gate leakage and junction leakage is a smaller factor to begin with. Generally dynamic power and gate leakage are unaffected by heat. However, subthreshold and junction leakage depend heavily on the temperature of the silicon (i.e. junction temperature).

The formula for subthreshold leakage in a single transistor is fairly complicated and involves many physical constants related to the process technology (A and n are process paraments, q is charge of an electron and k is Boltzmann’s constant) and size of the transistor (Width, Length) [2]. To simplify the situation the second line replaces all the constants with alpha and beta to focus on the influence of temperature (T). The other key variable is Vt, the threshold voltage, where the transistor effectively switches from the ‘off’ to ‘on’ state.

As the equation makes clear, subthreshold leakage increases exponentially with temperature. The impact of Vt is also quite substantial, which is why most modern designs heavily emphasize using high Vt transistors except where performance is critical. Junction leakage also increases exponentially with temperature, but it is a far smaller effect overall and a lesser concern. Since better cooling will reduce the junction temperature, theory clearly predicts it will also reduce subthreshold leakage.

The theory is important and helps to understand the general principles involved. However, practical examples are much more helpful in actually highlighting the real impact of cooling and leakage. It is relatively rare that a microprocessor vendor will discuss specific leakage components and give hard numbers; the data is often considered sensitive. Moreover, there are also very few microprocessor vendors that extensively use exotic cooling. Historically, only IBM’s mainframes and high-end POWER line could afford solutions beyond air cooling.

However, Fujitsu published a paper last year describing the SPARC64VIIIfx microprocessor and specific techniques that were used to reduce power. The low power is critical since the processor is used in the massive Kei Supercomputer at Riken; currently the fastest on the TOP500 with 8.16 PFLOP/s on Linpack. The processor is a 2GHz, 8-core design with 128GFLOP/s, while dissipating only 58W [3]. The cores use 48W, while the L2 cache, DDR3 interfaces and coherency interconnects use 10W. The SPARC64VIIIfx uses water cooling to keep the temperature under 30C. The authors estimated that lowering the junction temperature from 85C to 30C saves 7W for a typical chip – more than the power consumption of an entire core at 2GHz. Put another way, using conventional air cooling would increase chip power consumption by 12%.

This was the single most effective technique for reducing power, outweighing any microarchitectural feature. To put that 7W it in context, a loop predictor saves 0.89W for the entire chip when executing a DGEMM loop (the worst case workload). Eliminating reads from the FP register file and instead using bypassing to feed the power hungry FPU only yields 1.4W. Controlling leakage currents is critical for any microprocessor or SoC.

Implications

The bottom line is that effective cooling can have a tremendous impact on the performance and power of a modern microprocessor. Based on academic studies and real world examples, water cooling can easily improve performance by 5% and performance/watt by 15-20%. The benefits are likely to be even larger for high power GPUs and server processors. Enthusiasts and overclockers have plenty of justification for an obsession with cooling; lowering the CPU temperature with liquid nitrogen or a Peltier to 5C or below could yield even larger gains than the ones documented by Fujitsu. Reliability is an additional benefit of lower temperatures and one of the reasons that IBM mainframes have been water cooled for decades.

Of course, techniques like liquid cooling are not free; they consume additional power for pumps, filters and other components. This overhead increases the total system power consumption, but that is the trade-off: lower power chips, but higher power systems and the added complexity and expense of the cooling. In some cases, this is a very reasonable trade-off – if the extra power budget enables further integration and decreases the number of chips in a system it could reduce overall cost.

The importance of cooling is one of the reasons that Nvidia and AMD tightly control their high-end GPUs. Products based on Nvidia’s Fermi and AMD’s Cayman can easily dissipate 250W for a single card. Many of the add-in board vendors lack the expertise to design sufficient cooling and have a tendency to opt for the cheapest components available. This is a big risk for AMD and Nvidia, since a poor cooling solution could result in a more power hungry and less reliable product and potentially expose thermal throttling problems.

Cooling is also critically important to the future of Moore’s Law. 3D packaging is widely acknowledged as the next step in integration, since it can be used for heterogeneous systems (e.g. packaging a CPU, DRAM and analog together). However, cooling a 3D package with several hot chips is even more challenging than traditional systems – especially since many components such as DRAM are ultra-sensitive to leakage and thus to temperature.

Perhaps the most important take away is that performance and power efficiency are intimately tied to temperatures and cooling – the two cannot be discussed in isolation. This a trend that will become more pronounced over time. Dynamic frequency adjustment techniques like Intel’s Turbo-mode heavily depend on sustaining reasonable temperatures, and 3D integration will take cooling requirements to the next level. While water cooling today is mainly in the realm of overclockers and supercomputers, who knows what the future will hold…

References

[1] Harris, D., et al. “The Fanout-of-4 Inverter Delay Metric,” unpublished manuscript, 1997, http://odin.ac.hmc.edu/~harris/research/FO4.pdf

[2] Liu, Y., et al. “Accurate Temperature-Dependent Integrated Circuit Leakage Power Estimation is Easy,” Design, Automation and Test in Europe, 2007.

[3] Okano, H., et al. “Fine Grained Power Analysis and Low-Power Techniques of a 128GFLOPS/58W SPARC64™ VIIIfx Processor for Peta-scale Computing,” 2010 Symposium on VLSI Circuits.


Pages: « Prev  1 2  

Discuss (37 comments)