22nm NTV SIMD Permute
A second paper from Intel Labs described a 22nm 256-bit SIMD permute unit designed for NTV. The vector unit operates at 0.28-1.1V and runs at 17MHz to 2.5GHz. The key components are a large 32-entry, 256-bit wide register file and a crossbar for moving data. The register file has 3 read and 1 write ports and can perform vertical shuffles across multiple entries, bypassing the crossbar entirely. The crossbar is capable of byte-wise any-to-any permutes, which matches and exceeds the requirements for Intel’s SSE and AVX instructions. As to be expected, the most significant adaptations were needed for memory, rather than the logic. The performance is primarily limited by the register file, rather than the crossbar logic.
The team from Intel highlighted three techniques which collectively reduced the voltage for logic by 0.15V. The most significant was enhancing multiple flip-flops with shared circuitry to tolerate variation. Special level shifters enable the crossbar logic to efficiently operate at a different voltage from the flip-flops. The last technique added shared redundant transistors to several circuits, which counteracts variation.
The register file in the permute unit was heavily modified to run at NTV, with attention paid to both reading and writing to the register file cells. The register file read path was converted from dynamic to static circuits, which netted a substantial 0.2V saving. This is a huge change in design; nearly all register files and SRAMs use dynamic circuits. However, dynamic circuits do not function well at low frequencies. In a way, this mirrors Intel’s choice to convert all the datapath logic in Nehalem to static circuits and a total reversal from the dynamic circuit techniques used in the Pentium 4.
To reduce the voltage for writes as well, two techniques were used in conjunction. The first is changing the write circuits to a topology with redundant transistors, which is substantially more robust. The write circuits were further enhanced by stabilizing the voltage into the register file cells. Overall the write voltage for the register file decreased by 0.25V.
The measured results show significant gains in energy efficiency at a nominal 0.9V and 1.8Ghz due to a highly efficient architecture. The register file performs basic blend operations in a single cycle, reducing energy by 66%. A more complex 64-bit 4×4 matrix transpose drops from 12 to 7 cycles and reduces energy by 53%. Moving data from an array-of-structs to a struct-of-arrays showed more modest benefits, taking 6 cycles instead of 8 and uses 40% less power. Additionally, the energy gains from near-threshold operation are nearly an order of magnitude compared to nominal voltages, as shown in Figure 4.
32nm Variable Precision FPU
Intel presented another paper that was not directly related to NTV, but is indirectly quite interesting because it hints at the most likely applications. The paper describes a 32nm variable precision floating point unit that executes fused multiply-adds every cycle, with three cycles of latency.
The FPU can operate in three different modes for varying performance, energy efficiency, and accuracy. All modes use a normal 8-bit exponent, but vary the size of the mantissa based on the necessary precision. The first is a standard IEEE-754 single precision mode that executes a single FMA per clock with a 24-bit mantissa. The second executes two FMAs with two 12-bit mantissas and the most efficient mode has four 6-bit mantissas. The floating point unit has new certainty tracking circuits, that determines when greater accuracy is necessary for a correct calculation. The certainty tracking and extra exponent calculations increase power by roughly 21% compared to single precision.
As shown in Figure 5, the energy efficiency increases by about 1.5× for two-wide execution and over 3× for four-wide operations. The efficiency also depends on the operating conditions, with about 7× greater savings at low voltage (although not quite near-threshold levels).
To improve performance and energy efficiency, the FPU speculatively computes results using reduced precision. The initial attempt will calculate four operations with 6-bit mantissas. If more accuracy is required, then the second attempt will use two 12-bit mantissas, before finally using standard single precision.
The total energy savings strongly depend on the data involved. Successfully using 6-bit mantissas will save considerable power. Calculations that must be recomputed with 12-bit mantissas still save about 7% energy compared to single precision. However, operations that only succeed in single precision will use roughly twice as much energy due to the retries. In all likelihood, a real implementation would need more intelligent speculation, to avoid wasting power. Additionally, the variable precision FPU did not address double precision (i.e. 53-bit mantissa), but it is easy to imagine extending the technique.
Discuss (86 comments)