By: Ian Cutress (ian.delete@this.anandtech.com), January 3, 2021 12:09 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on January 2, 2021 12:21 pm wrote:
> Jukka Larja (roskakori2006.delete@this.gmail.com) on January 1, 2021 10:28 pm wrote:
> >
> > So yeah, I do very much agree AMD has superior offering. ECC doesn't really matter here though.
>
> ECC absolutely matters.
>
> Yes, I'm pissed off about it. You can find me complaining about this literally for decades
> now. I don't want to say "I was right". I want this fixed, and I want ECC.
>
> And AMD did it. Intel didn't.
>
> > I don't really see AMD's unofficial ECC support being a big deal.
>
> I disagree. The difference between "the market for working memory actually exists" and "screw
> consumers over by selling them subtly unreliable hardware" is an absolutely enormous one.
>
> And the fact that it's "unofficial" for AMD doesn't matter. It works. And it allows
> the markets to - admittedly probably very slowly - start fixing themselves.
>
> Linus
I want to add something in here, just from what we've heard from our users.
To the extent that even though you can have a consumer CPU and ECC memory installed, and the motherboard reports that ECC is enabled, actually ECC might not be enabled. Even software that states that ECC is enabled is simply reading the motherboard register - the only way to confirm is to actually do a test that forces an ECC correction and to monitor them. This means that a chunk of people who actually think they have ECC working on a system do not. Finding the right combination of motherboard, motherboard BIOS/firmware, and memory to work is somewhat confusing because people are reporting that 'ECC is enabled', when it's simply only being reported as such by the motherboard and not actually tested.
This comes mostly down to the fact that it's 'unofficial' for AMD. It's not part of the POR, it's not qualified at every stage of the CPU/motherboard design. Vendors don't have to check if it's actually working for consumer-grade CPUs, so whatever gets reported doesn't matter, because it's not part of the validation checks.
This is why official support matters. At the moment AMD systems unofficially supporting ECC is a quagmire of 'system reporting as ECC enabled' vs ECC actually being enabled, tested for, and working. It's a step in the right direction sure, but end-users wanting this feature might not be protected at all, and spending extra for explicit support.
---
The analogy I always like to bring up for ECC in regular use is that imagine you have a theoretical system that is affected by one bit error per year for every gigabyte of memory you have. 1 E/GB/yr.
For a system with 128 GB, that means 128 E/GB/yr, or one soft error slightly more than every three days. You have to hope that error falls in memory you're not using. As systems get more memory, then steps need to be taken to protect from soft errors.
Memory error rates are well below 1 E/GB/yr, but even then that's still a crazily low error rate if you think about it. In non-standard environments (high thermals, etc), the error rates could be that high. I take it as a rule of thumb at this point for any system build.
Obviously malicious errors are somewhat different.
It is also worth noting that memory error rates are typically given for the bin of the memory used. If your memory is overclocked, or the CPU is overclocked, that matters. But also it matters what you're putting in the system: AMD Ryzen 3000/5000 CPUs are rated for DDR4-3200, but if you put in DDR4-3600 (or faster memory), above official support, then on paper the error rates are likely to increase. Vendors obviously still try and match the requirements for those speeds and keep error rates low, and it all comes down to board-to-board and chip-to-chip design. If we go ECC across the board, then consumer enthusiast non-ECC memory as a market will disappear.
My $0.02.
> Jukka Larja (roskakori2006.delete@this.gmail.com) on January 1, 2021 10:28 pm wrote:
> >
> > So yeah, I do very much agree AMD has superior offering. ECC doesn't really matter here though.
>
> ECC absolutely matters.
>
> Yes, I'm pissed off about it. You can find me complaining about this literally for decades
> now. I don't want to say "I was right". I want this fixed, and I want ECC.
>
> And AMD did it. Intel didn't.
>
> > I don't really see AMD's unofficial ECC support being a big deal.
>
> I disagree. The difference between "the market for working memory actually exists" and "screw
> consumers over by selling them subtly unreliable hardware" is an absolutely enormous one.
>
> And the fact that it's "unofficial" for AMD doesn't matter. It works. And it allows
> the markets to - admittedly probably very slowly - start fixing themselves.
>
> Linus
I want to add something in here, just from what we've heard from our users.
To the extent that even though you can have a consumer CPU and ECC memory installed, and the motherboard reports that ECC is enabled, actually ECC might not be enabled. Even software that states that ECC is enabled is simply reading the motherboard register - the only way to confirm is to actually do a test that forces an ECC correction and to monitor them. This means that a chunk of people who actually think they have ECC working on a system do not. Finding the right combination of motherboard, motherboard BIOS/firmware, and memory to work is somewhat confusing because people are reporting that 'ECC is enabled', when it's simply only being reported as such by the motherboard and not actually tested.
This comes mostly down to the fact that it's 'unofficial' for AMD. It's not part of the POR, it's not qualified at every stage of the CPU/motherboard design. Vendors don't have to check if it's actually working for consumer-grade CPUs, so whatever gets reported doesn't matter, because it's not part of the validation checks.
This is why official support matters. At the moment AMD systems unofficially supporting ECC is a quagmire of 'system reporting as ECC enabled' vs ECC actually being enabled, tested for, and working. It's a step in the right direction sure, but end-users wanting this feature might not be protected at all, and spending extra for explicit support.
---
The analogy I always like to bring up for ECC in regular use is that imagine you have a theoretical system that is affected by one bit error per year for every gigabyte of memory you have. 1 E/GB/yr.
For a system with 128 GB, that means 128 E/GB/yr, or one soft error slightly more than every three days. You have to hope that error falls in memory you're not using. As systems get more memory, then steps need to be taken to protect from soft errors.
Memory error rates are well below 1 E/GB/yr, but even then that's still a crazily low error rate if you think about it. In non-standard environments (high thermals, etc), the error rates could be that high. I take it as a rule of thumb at this point for any system build.
Obviously malicious errors are somewhat different.
It is also worth noting that memory error rates are typically given for the bin of the memory used. If your memory is overclocked, or the CPU is overclocked, that matters. But also it matters what you're putting in the system: AMD Ryzen 3000/5000 CPUs are rated for DDR4-3200, but if you put in DDR4-3600 (or faster memory), above official support, then on paper the error rates are likely to increase. Vendors obviously still try and match the requirements for those speeds and keep error rates low, and it all comes down to board-to-board and chip-to-chip design. If we go ECC across the board, then consumer enthusiast non-ECC memory as a market will disappear.
My $0.02.