Near-Threshold Voltage Analysis
Individually, each of the three papers show intriguing results for improving energy efficiency. Beyond academic curiosity though, the real question is: Where will near-threshold voltage will be adopted in real products? Looking at all three papers collectively, there are hints and suggestions where near-threshold voltage is a good fit.
The NTV results present a consistent set of trade-offs. Compared to conventional designs, NTV techniques enable dramatically lower voltages and substantially better energy efficiency. At near-threshold voltages, the efficiency improves by around 4-7×. In the modern power constrained world, using the available power much more effectively translated into substantially increasing performance.
However, there are significant costs. NTV circuits use substantially more area and transistors than conventional designs. The 0.6µm Pentium used 3.3 million transistors. In contrast, the 32nm NTV Pentium was implemented with 6 million transistors. The 80% increase in transistor count was largely driven by NTV circuit techniques and it is safe to assume that the area would increase as well (relative to a ‘normal’ implementation of the Pentium core in 32nm).
Moreover, the optimal operating point for NTV designs tends to be quite low. The 32nm Pentium core increased efficiency by about 5×, by running at slightly under 100MHz. The maximum frequency was 915MHz, so the absolute performance decreased by about an order of magnitude. That is a tremendous sacrifice to achieve energy efficiency; one that may not be feasible for many applications.
As an additional observation, the 22nm NTV vector permute unit and the 32nm variable precision FPU papers were partially funded by a grant from the US government. This suggests that the two techniques are aimed at a single larger goal. Presumably, this grant is related to research programs on energy efficient computing, or perhaps the Exascale program.
The variable precision FPU also has a particular set of trade-offs. Speculatively reducing precision can save substantial energy by improving throughput, while keeping power roughly constant. However, it is unlikely to be quite as useful in the context of software that relies on scalar computations. In that case, reducing precision should modestly improve power. So the ideal workload would be amenable to vectorization.
Moreover, retrying uncertain calculations introduces an element of unpredictability. In the best case, the FPU can perform 4 FMAs/cycle. However, calculations requiring single precision would proceed at a single FMA/cycle, while wasting at least 1 cycle on the initial attempt. So the latency for 4 FMAs could vary from 3 cycles to 13, and perhaps more depending on the implementation and uncertainty detection. On an in-order processor core, this could potentially stall the pipeline while recalculating at higher precision. It would also significantly complicate code scheduling for compilers and programmers. However, an out-of-order processor core can easily inject the retry operations into the pipeline without stalling other instructions, minimizing the impact.
Future Near-Threshold Voltage Applications
The trade-offs associated with near-threshold voltage techniques strongly suggest certain applications. General purpose CPUs for client systems are unlikely to benefit from NTV. While energy efficiency is important, sacrificing frequency is inconsistent with the overall design targets. Client workloads are very bursty, with a small number of programs executing at any one time. The responsiveness of the system is largely determined by latency, and performance does not always improve with throughput. Modern CPUs run at high frequencies (2-4GHz) because software is inherently unpredictable. Increasing frequency improves performance for nearly all workloads, including those with many branches and unpredictable memory references. Sacrificing frequency for higher throughput would simply decrease performance in many cases. Moreover, client systems are very cost sensitive and the extra area is simply not worth the minimal benefits.
General server processors are in a very similar situation, because many applications need high single threaded performance. However, massively parallel workloads, like those targeted by low power ‘cloud servers’ (e.g. SeaMicro, Calxeda), would be a reasonable fit. Massive server farms are often power constrained, and if single threaded performance is not critical, then the gain in energy efficiency is likely to be worthwhile. Additionally, server CPUs are expensive and high margin, so it is reasonable to spend die area to increase energy efficiency.
The strongest matches for NTV are graphics-like workloads and high performance computing (HPC). Graphics processors (as well as image processors and DSPs commonly found in tablets and smartphones) are inherently throughput focused, with minimal emphasis on each individual thread. Modern GPUs run at fairly low frequencies, from roughly 200MHz for mobile graphics to 1.5GHz for high-end solutions. Nearly all GPUs are power limited; so improving efficiency would directly translate into performance. Graphics workloads are inherently floating point heavy and easy to vectorize, with a minimal emphasis on accuracy. The variable precision FPU would be a good fit here as well; errors caused by reduced precision may not even be perceptible to consumers. Die area is still constrained though, as GPUs are primarily useful for client systems and cost is a major factor. However, smartphones and tablets are so energy sensitive that the costs could be outweighed by greater battery life.
The best fit is the world of HPC, which combines the favorable qualities of the graphics and server market. As the Exascale project has highlighted, power is the most significant constraint for HPC systems. Like GPUs, workloads are very parallel and performance is largely measured in terms of throughput. Reducing frequency to improve efficiency is an excellent choice. Accuracy is crucial for HPC, and a variable precision FPU fits well. However, a more advanced design would be needed, since HPC has significant double precision requirements. Unlike client GPUs, HPC accelerators are fairly expensive. Each accelerator can typically perform the work of 2-4 server CPUs and costs several thousand dollars, while using 400mm2 or more. There is no question that a higher performance design would be able to sustain higher prices and pay for the additional die area. Perhaps most telling, US government grants typically focus on areas of national interest. Graphics simply is not vital to the country, whereas HPC is a critical tool for the Departments of Defense, Energy, and any number of intelligence agencies.
In summary, Intel’s work on near-threshold voltage is a novel approach that demonstrates substantially greater throughput and energy efficiency, at the cost of frequency and die area. The nature of NTV is uniquely suited to HPC workloads, which will significantly benefit from the energy advantages and can afford the associated costs. It is likely that the first commercial implementations will appear in Intel’s HPC-oriented products, such as Knights Landing or successor projects. Other potential targets include Intel’s integrated GPUs for PCs, tablets and smartphones as well as SoC components such as DSPs and image processors.
Discuss (86 comments)