Inside Fermi: Nvidia’s HPC Push

Pages: 1 2 3 4 5 6 7 8 9 10 11

Reliability

As we predicted previously, Fermi will have optional ECC support to protect data stored in memory, for both DDR3 and GDDR5, and the on-chip SRAM arrays are also protected.

The former is unsurprising, as DDR3 is designed for ECC from the start. However, ECC for graphics memory is much more interesting, as Nvidia’s engineers had to go above and beyond the GDDR5 specification to achieve that level of protection. Unfortunately, Nvidia did not disclose the algorithms and techniques used for ECC. However, they did say that when using ECC they observed an application performance drop of 5-20%, with some applications suffering even more. This probably corresponds to a drop of 25-30% in memory bandwidth. It will be interesting to see the actual mechanisms and how much bandwidth and capacity they really cost, especially if they did something particularly novel.

Using ECC helps protect the data stored in DRAMs, but does not necessarily protect the commands, addresses and data as it is sent from the DRAM to the memory controller (and vica versa) – that is the role of the memory interface specification. GDDR5 optionally has CRC and retry protection for the data transmission lines that run between the memory controllers and the DRAMs, which Nvidia will surely use. However, the GDDR5 specification does not have any support for protecting the command and addressing lines [5].

So while the data transmission can be protected, there is no guarantee that the data is actually coming from the right location. Nvidia claims to have some sort of protection for the command and addressing lines in GDDR5, which is possible, but seems a little unlikely without an explanation of how they exactly achieve that. For instance, each of the command and address lines could be replicated on the controller and board and then checked against each other before terminating at the DRAM – but this would only detect a very limited class of errors. This is not to say that Nvidia’s claims are unreasonable, but at this point, skepticism is merited until the actual methods are disclosed.

Astute readers will note that DDR3 does not have any such protection whatsoever. However, this is not the grievous deficiency it seems to be, as the data lines are running substantially slower (maximum of 2gbps compared to a minimum of 3.6gbps for GDDR5), so the chance of transmissions errors is far lower.

Nvidia’s engineers also decided to protect the 3.75MB of on-chip SRAM arrays with ECC, using a standard single error correct and double error detect (SECDED) algorithm. This encompasses the 16x128KB register files, the 16x64KB L1D caches and the 768KB L2 cache. The L1I cache may not be protected, as it is read only.

For now more advanced optimizations, such as tolerating DRAM failures, pro-active memory or SRAM scrubbing, are not implemented. But high-end CPUs such as those from IBM and Intel have already trod quite far down this path, leaving a clear roadmap should Nvidia and other choose to follow.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10 11   Next »

Discuss (281 comments)