A GPU Computing Push
Nvidia’s corporate strategy firmly rests on expanding the market for GPUs beyond graphics to include certain types of computation. Specifically, Nvidia’s efforts with CUDA are aimed at moving GPUs into the high performance computing (HPC) market – where the substantial compute capabilities and memory bandwidth directly translate into performance. Nvidia’s Tesla products (GPUs designed for computation instead of graphics) have made a bit of a splash, but at the moment the adoption is extremely limited. GPU clusters are basically non-existent, at least in part due to the lack of error detection and correction, which we believe will be corrected in the next product release from Nvidia.
HPC: CorrECCtness Required
ECC is an essential requirement for servers, and especially clusters of servers, which are the predominant compute workhorses in the HPC world. All servers come standard with ECC memory and other reliability, availability and serviceability (RAS) features (e.g. ECC on caches). Generally, these RAS features are driven from above by the proprietary microprocessor families such as Itanium, PowerPC, SPARC and zArch, with industry standard x86 servers typically lagging behind a bit. Without ECC, it’s simply not possible to build reliable clusters, since the soft error rates in DRAM are too high – and buyers are sophisticated enough to understand this.
Moreover, the long term trends in process technology and semiconductor device scaling make errors more prevalent in both SRAM and DRAM. Increasing density, increasing signaling rates, decreasing voltage and decreasing the amount of charge representing a single bit exponentially raise the risk of soft errors (SERs), which cause data corruption. Unfortunately, these are all the same changes that make semiconductors cheaper and faster over time, so they need to be adequately addressed as Nvidia moves to 40nm and eventually 28nm and beyond.
Historically, the graphics world has not been concerned with SERs – if a pixel’s color is off by a single bit (or even multiple bits), it doesn’t really matter. Graphics applications just don’t need the same level of correctness as the rest of the system, since human eyes will compensate for many errors. As GPUs evolve to be more general purpose and this is one of the areas where they are following in the footsteps of CPUs and providing greater functionality. For example, GDDR5 was the first generation of graphics memory interface to include any error detection at all, but it only covers data (note that we are discussing errors on the GDDR5 bus itself, and not in the DRAM) [1]. Almost every other high speed signaling interface (>2GT/s) in the PC had comprehensive error detection and retry – PCI-Express, QuickPath nee CSI, HyperTransport, Fully Buffered DIMMs, etc. GDDR4, despite running upto 3.2GT/s had no error detection at all.
Without ECC and other forms of error protection, the only alternatives are to calculate any important values twice and compare the results to detect an error – which can halve your performance (although some algorithms tend to be robust against errors). Since the main selling point for a GPU is the high performance, that is quite problematic. For double precision, a GT200 or GT200b is only 65% or 88% faster than Nehalem – computing the results twice would make the GPU slower than a standard CPU.
Costs and Benefits of ECC
Adding ECC support is pretty straight forward from an engineering perspective. ECC requires arrays of 9 DRAMs (instead of 8), extra pins to connect to the extra DRAMs and more logic in the memory controllers. This will noticeably increase the bill of materials for a GPU; not a problem for the expensive and margin-rich professional grade products such as Tesla and Quadro. However, it does mean that adding ECC support to Nvidia’s GPUs cannot substantially increase costs for consumer products, since ASPs and margins are decreasing by virtue of ATI’s singular focus on graphics and incipient products from Intel.
Ultimately adding ECC to Tesla products (and probably Quadro too) could be quite advantageous. As noted above, it will enable Nvidia to make more progress selling GPUs into the HPC market and reduce the hesitation of some buyers. However, it could benefit Nvidia by differentiating between consumer and professional GPUs, which both can run CUDA – after all the, price delta is enough to make buyers contemplate skipping a Tesla in favor of a GeForce. Moreover, it would differentiate Tesla and Quadro from competing products from ATI and Intel (which may not have ECC) and provide a marketing advantage.
The folks at Nvidia aren’t stupid; they have considerable expertise in HPC and understand the market requirements (in fact they have even published papers discussing reliability issues for GPUs). But entering a new market is not done all in one step – it’s a gradual process; each product iteration feeding on prior successes and feedback. Now that Nvidia has gotten their toes into the HPC waters, it’s only logical that the next step will be to add ECC memory, enabling more users to consider GPUs for their computing needs. Perhaps at the same time, they will address the previously noted unreliability of the GDDR5 interface for data writes.
Above and beyond ECC memory, there are further steps that can be taken. The simplest form of ECC is Single bit Error Correct and Double bit Error Detect (SECDED), which is likely what Nvidia will implement in their next generation GPU. There are further techniques to detect and correct multi-bit errors, and even the failure of an entire DRAM. The latter is likely to be particularly important for Nvidia, given that a DRAM failure requires replacing the entire product (unlike in servers, where faulty memory is easily removed). Outside the memory system, Nvidia might consider parity or ECC for the on-chip register files and SRAM, which total around 3MB in the current generation (and will grow substantially if Nvidia ends up using real caches). But on-chip storage (or compute logic) issues are even further down the road than addressing the rather obvious reliability concerns in the memory system.
GPUs and ECC: A Question of When
We have laid out the case for Nvidia adding ECC memory support to their GPUs, discussing the needs, costs and benefits. This is an obvious and inevitable development, as Nvidia evolves their GPUs to more closely resemble CPUs and go after the HPC market. The only real question is when Nvidia will add ECC support – and the next GPU generation is where we’d put our bets down.
[1] Qimonda GDDR5 – White Paper, August 2007. http://www.qimonda-news.com/download/Qimonda_GDDR5_whitepaper.pdf
Discuss (45 comments)