One of the big differentiators between servers appears to be RAS features. The Opteron is notably light there. Does HORUS bring anything to the table that improves reliability and availability of Opteron based servers? Does the onboard memory controller hamper this?
From the start of HORUS, we focused on RAS extensively. We improve the reliability and availability significantly. With respect to the Opteron, we don’t have to “work around” Opteron to increase the reliability or availability of the system. However, with HORUS and system management software running on our service processor we have lots of RAS features built in and supported.
Regarding reliability, can you state what fraction of non-scan latches & flip-flops are protected by ECC or parity?
All arrays are ECC protected, using single bit correction and double bit detection. Our crossbars buses are not parity or ECC protected, but we have error checking that will detect invalid encoding and send machine check events to the service processor.
Are there any plans to incorporate lock-stepping, voting or other fault-tolerant features in future versions of HORUS?
No such plans for HORUS currently.
Editor’s Note: Pages 28 and 29 of the HORUS presentation at Hot Chips show more information regarding the RAS features in HORUS, such as the service processor, recoverability of errors etc.
Be the first to discuss this article!