By: Dan Strother (dan.strother.delete@this.gmail.com), January 1, 2021 1:12 pm
Room: Moderated Discussions
David Hess (davidwhess.delete@this.gmail.com) on January 1, 2021 12:31 pm wrote:
> Jukka Larja (roskakori2006.delete@this.gmail.com) on January 1, 2021 10:43 am wrote:
> > Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on January 1, 2021 7:10 am wrote:
> > >
> > > What does "officially" mean in this context? All non-APU
> > > Ryzen CPUs support ECC if the motherboards have the
> > > necessary traces and UEFI support. Motherboard vendors advertise this support quite clearly in the specs.
> >
> > Trying to google about how well the unofficial support works, I get lot of hits about people saying that
> > yes, it works, without any proof. I don't see people with a test DIMMs known to produce single bit errors
> > making sure the unofficial support works, or making sure it works in every CPU or at least gives some easy
> > to see error somewhere if it doesn't (I'm sure someone somewhere has tested something, but it gets lost
> > in the noise. Anecdotes are only useful if there's enough of them to be statistically significant).
> >
> > I really like what AMD is doing with CPUs, but unofficial ECC support just
> > annoys me. It's supposed to give me peace of mind and eliminate one source
> > of random problems. "Unofficial" really doesn't work great with that goal.
>
> For consumer level hardware I think the weaker link is BIOS and operating system
> support. For most users the best they can do is verify that ECC is enabled,
> and then wait weeks to months to see if any reports are generated.
>
> In the old days the memory interface was slow enough that I could have hacked together some logic and connected
> it to insert single bit errors but today that is close to impossible without a custom DIMM layout which supports
> it. I might try it with a sampling bridge operating in reverse but not on any hardware I want to keep.
>
You don't need any custom hardware to inject single bit errors (so long as you're okay with injecting a whole lot of them at once); shorting one of the DRAM data lines to ground through a suitable resistor works. I've successfully injected errors with a 100 ohm resistor to ground on my Xeon W-1290P + Supermicro X12SAE (W480 chipset, no BMC/IMPI). (I've seen people suggest shorting directly to ground without a resistor as well, but this makes me very nervous as an EE..)
If your motherboard supports overclocking, this may also be an option. I previously had an Asus W480 motherboard which supported overclocking. It was extremely fiddly, but I did eventually find a mixture of higher frequency + lower voltage that yielded correctable errors (and the occasional uncorrectable). The edge between "working reliably with zero errors" and "not booting" was surprisingly slim.
BIOS and OS support is a major problem. Neither motherboard logged the errors in any way. The Asus board was happy to keep running even after an uncorrectable error (I suspect the Supermicro would as well, but I couldn't generate sporadic uncorrectable errors on it due to lack of overclocking). Windows didn't detect any errors either (it did report that ECC was enabled, but that was it).
Linux doesn't yet support the W-1290P either, but I was at least able to hack up a modified ie31200_edac driver that appeared to correctly report errors (both CE and UE). This just involved adding a new PCI device ID to the driver. (I assumed the W-1200 series was the same as the last few 14nm rehashes, but wasn't able to confirm due to the W-1200 datasheet not being available at the time).
Ironically, I went with the Xeon over a Ryzen because I wanted a platform that had more "official" ECC support (and, to a lesser degree, because I really hated the idea of dealing with a chipset fan on the AMD X570). I had spent quite a while researching ECC support on the Ryzen side, and had a lot of trouble finding any concrete evidence that it actually worked (either by injecting errors through overclocking or pin-shorting). There was some evidence that it worked on older Ryzens, but not for contemporary ones.
In the end, I came away disappointed with the state of ECC on the Intel side. If I were to do it again, I'd probably wind up on the AMD side - especially now that I know how to reliably verify that ECC is really working.
- Dan
> Jukka Larja (roskakori2006.delete@this.gmail.com) on January 1, 2021 10:43 am wrote:
> > Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on January 1, 2021 7:10 am wrote:
> > >
> > > What does "officially" mean in this context? All non-APU
> > > Ryzen CPUs support ECC if the motherboards have the
> > > necessary traces and UEFI support. Motherboard vendors advertise this support quite clearly in the specs.
> >
> > Trying to google about how well the unofficial support works, I get lot of hits about people saying that
> > yes, it works, without any proof. I don't see people with a test DIMMs known to produce single bit errors
> > making sure the unofficial support works, or making sure it works in every CPU or at least gives some easy
> > to see error somewhere if it doesn't (I'm sure someone somewhere has tested something, but it gets lost
> > in the noise. Anecdotes are only useful if there's enough of them to be statistically significant).
> >
> > I really like what AMD is doing with CPUs, but unofficial ECC support just
> > annoys me. It's supposed to give me peace of mind and eliminate one source
> > of random problems. "Unofficial" really doesn't work great with that goal.
>
> For consumer level hardware I think the weaker link is BIOS and operating system
> support. For most users the best they can do is verify that ECC is enabled,
> and then wait weeks to months to see if any reports are generated.
>
> In the old days the memory interface was slow enough that I could have hacked together some logic and connected
> it to insert single bit errors but today that is close to impossible without a custom DIMM layout which supports
> it. I might try it with a sampling bridge operating in reverse but not on any hardware I want to keep.
>
You don't need any custom hardware to inject single bit errors (so long as you're okay with injecting a whole lot of them at once); shorting one of the DRAM data lines to ground through a suitable resistor works. I've successfully injected errors with a 100 ohm resistor to ground on my Xeon W-1290P + Supermicro X12SAE (W480 chipset, no BMC/IMPI). (I've seen people suggest shorting directly to ground without a resistor as well, but this makes me very nervous as an EE..)
If your motherboard supports overclocking, this may also be an option. I previously had an Asus W480 motherboard which supported overclocking. It was extremely fiddly, but I did eventually find a mixture of higher frequency + lower voltage that yielded correctable errors (and the occasional uncorrectable). The edge between "working reliably with zero errors" and "not booting" was surprisingly slim.
BIOS and OS support is a major problem. Neither motherboard logged the errors in any way. The Asus board was happy to keep running even after an uncorrectable error (I suspect the Supermicro would as well, but I couldn't generate sporadic uncorrectable errors on it due to lack of overclocking). Windows didn't detect any errors either (it did report that ECC was enabled, but that was it).
Linux doesn't yet support the W-1290P either, but I was at least able to hack up a modified ie31200_edac driver that appeared to correctly report errors (both CE and UE). This just involved adding a new PCI device ID to the driver. (I assumed the W-1200 series was the same as the last few 14nm rehashes, but wasn't able to confirm due to the W-1200 datasheet not being available at the time).
Ironically, I went with the Xeon over a Ryzen because I wanted a platform that had more "official" ECC support (and, to a lesser degree, because I really hated the idea of dealing with a chipset fan on the AMD X570). I had spent quite a while researching ECC support on the Ryzen side, and had a lot of trouble finding any concrete evidence that it actually worked (either by injecting errors through overclocking or pin-shorting). There was some evidence that it worked on older Ryzens, but not for contemporary ones.
In the end, I came away disappointed with the state of ECC on the Intel side. If I were to do it again, I'd probably wind up on the AMD side - especially now that I know how to reliably verify that ECC is really working.
- Dan