Nvidia Transitions to SoCs
Nvidia is a company in flux. Historically the leader in PC and workstation graphics, they faced a critical choice several years ago. As AMD and Intel integrated low-end GPUs into their microprocessors, discrete graphics came under increasing pressure. Of course, as Moore’s Law progresses, integrated GPUs will become ever more capable and eventually threaten the mid-range of the graphics market. One alternative was to design (or purchase) an x86 microprocessor and compete in the PC market with an integrated offering. Instead, Nvidia chose to pivot away from the familiar PC ecosystem and enter the growing market for tablet and smartphone systems-on-chip (SoCs), using CPU cores licensed from ARM.
Their current generation Tegra 2 is manufactured on TSMC’s 40nm triple gate oxide process (LPG) process. The dual Cortex A9 cores are optimized for performance, which is one of the reasons that the Tegra 2 cores run so fast. However, this comes at a very high cost in terms of idle power. Compared to a low-power process the leakage currents are orders of magnitude worse. The Tegra 2 cores are clock, but not power gated so leakage currents are a problem even in an idle state. In contrast, almost every other smart phone SoC is manufactured on a low power (LP) process technology. The two undisputed market leaders, Qualcomm and TI, have both made numerous presentations at conferences that explain the strong preference for low leakage. Tegra 2 has met with some success for tablets, in part because of the high performance versus the competition. But unsurprisingly, the idle power is prohibitive for smart phones, and Nvidia has very few design wins or shipments because of these problems.
Recently, Nvidia revealed some rather interesting details for their upcoming Kal-El SoC. Like Tegra 2, it is manufactured on TSMC’s 40LPG and relies on the ARM Cortex A9 for the application processors. However, the similarities end there. Tegra 2 is a dual-core design with no SIMD extensions, while the high-end Kal-El is a 4+1 core design (2+1 variants are likely to appear as well) that uses the NEON vector extensions. Kal-El’s five cores are logically identical, but physically asymmetrical. The four main cores use high performance (G) transistors, which come in low and standard Vt flavors and operate at ~0.9V to reduce dynamic power consumption (low Vt transistors are faster, but much leakier). The ‘companion’ core and most of the SoC are implemented with low-leakage (LP) transistors that are only available in standard and high Vt and run at ~1.1V nominal. Generally the LP transistors have 10-100X lower leakage power, but operate about 2-3X slower. Tegra 2 was a compact the 49mm2 chip and Kal-El is estimated at 80mm2. The additional 3 extra cores, a faster GPU and more powerful video acceleration account for the bulk of the extra area, as the cache and memory interfaces are largely the same.
Kal-El is not the first asymmetric design. TI’s OMAP4 features two Cortex A9 cores for performance, with two low-power Cortex M3 cores to off-load real-time tasks (e.g. video, display and image codecs). However, Kal-El is the first SoC to have asymmetric cores running the same software stack. The system transparently determines whether a workload is CPU insensitive (e.g. background tasks, video playback) or needs full performance (e.g. web browsing). Based on the workload analysis and OS feedback, Kal-El can switch between the two modes with ~2ms latency; data is communicated between the asymmetric cores through the shared 1MB L2 cache. There is also hysteresis control, to ensure that mode switching does not happen too frequently and hurt performance or energy use. The companion core executes low intensity workloads, while the four main cores are power gated off, reducing idle power. Demanding workloads will shut down the companion core and rely on the main cores to execute as fast as possible to reach an idle state.
This technique is a novel and creative approach to reducing power, compared to the Tegra 2. In theory, this gives Kal-El the best of both worlds – good peak performance and attractive idle power. However, the implementation results remain to be seen. The 2ms switch overhead is very high; roughly 100X worse than the latency for waking up a power gated core in Intel’s Sandy Bridge or AMD’s Llano. Realistically, this means that Kal-El will stay in a given mode (standby or regular) for 20-40ms to amortize the switch overhead. As a result, many workloads with bursty characteristics will end up activating the main cores and burning more leakage power – even if they are primarily idle. However, for scenarios with long idle periods, the benefits are quite substantial. This comes at a price as the A9 is a multi-issue, out-of-order design and an extra core occupies around 2.5-4mm2. In the context of the reportedly 80mm2 Kal-El, that translates into an additional 3-5% area.
As an aside, the triple gate oxide process increases the complexity and cost of both Tegra 2 and Kal-El. The LPG process is more flexible and gives additional performance versus a standard LP, but it adds several mask layers to form the extra gate oxide (for the extra high performance logic transistors). Actually forming the second logic gate oxide will also heat up the previously formed gates and have a minor yield impact. Relatively few companies use TSMC’s LPG option, and we estimate that it adds about 5-7% to Nvidia’s costs when taking the volumes into account, versus a traditional LP node.
Discuss (2 comments)