By: Tim McCaffrey (timcaffrey.delete@this.aol.com), January 3, 2021 9:28 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on January 2, 2021 12:21 pm wrote:
> Jukka Larja (roskakori2006.delete@this.gmail.com) on January 1, 2021 10:28 pm wrote:
> >
> > So yeah, I do very much agree AMD has superior offering. ECC doesn't really matter here though.
>
> ECC absolutely matters.
>
> ECC availability matters a lot - exactly because Intel has been instrumental in
> killing the whole ECC industry with it's horribly bad market segmentation.
>
> Go out and search for ECC DIMMs - it's really hard to find. Yes - probably entirely thanks
> to AMD - it may have been gotten slightly better lately, but that's exactly my point.
>
> Intel has been detrimental to the whole industry and to users because
> of their bad and misguided policies wrt ECC. Seriously.
>
> And if you don't believe me, then just look at multiple generations of rowhammer, where each
> time Intel and memory manufacturers bleated about how it's going to be fixed next time.
>
> Narrator: "No it wasn't".
>
> And yes, that was - again - entirely about the misguided and arse-backwards policy
> of "consumers don't need ECC", which made the market for ECC memory go away.
>
> The arguments against ECC were always complete and utter garbage. Now even the memory manufacturers are
> starting do do ECC internally because they finally owned up to the fact that they absolutely have to.
>
> And the memory manufacturers claim it's because of economics and lower power. And they are
> lying bastards - let me once again point to row-hammer about how those problems have existed
> for several generations already, but these f*ckers happily sold broken hardware to consumers
> and claimed it was an "attack", when it always was "we're cutting corners".
>
> How many times has a row-hammer like bit-flip happened just by pure bad luck on real
> non-attack loads? We will never know. Because Intel was pushing shit to consumers.
>
> And I absolutely guarantee they happened. The "modern DRAM is so reliable that it doesn't need ECC"
> was always a bedtime story for children that had been dropped on their heads a bit too many times.
>
> We have decades of odd random kernel oopses that could never be explained and were likely due to
> bad memory. And if it causes a kernel oops, I can guarantee that there are several orders of magnitude
> more cases where it just caused a bit-flip that just never ended up being so critical.
>
> Yes, I'm pissed off about it. You can find me complaining about this literally for decades
> now. I don't want to say "I was right". I want this fixed, and I want ECC.
>
> And AMD did it. Intel didn't.
>
> > I don't really see AMD's unofficial ECC support being a big deal.
>
> I disagree. The difference between "the market for working memory actually exists" and "screw
> consumers over by selling them subtly unreliable hardware" is an absolutely enormous one.
>
> And the fact that it's "unofficial" for AMD doesn't matter. It works. And it allows
> the markets to - admittedly probably very slowly - start fixing themselves.
>
> But I blame Intel, because they were the big fish in the pond, and they were the
> ones that caused the ECC market to basically implode over a couple of decades.
>
> ECC DRAM (or just parity) used to be standard and easily accessible back when. ECC
> and parity isn't a new thing. It was literally killed by bad Intel policies.
>
> And don't let people tell you that DRAM got so reliable that it
> wasn't needed. That was never ever really true. See above.
>
> Linus
Parity existed on every motherboard until EDO DRAM got introduced.
I think there were a couple of factors:
1) EDO DRAM, IIRC, was produced in a x8 package only, there was no x1 (or x9) package.
(This made adding a parity bit difficult).
2) ECC is a pretty big step up (when EDO was introduced) in the memory controller,
easier to just not include it.
3) At the time EDO was introduced, memory was very expensive (I paid $300 for 8 Meg
at that point in time, of course memory prices crashed right after that :( ).
4) Once motherboard & memory controller (north bridge) vendors got away without supporting
parity for a couple of years, everybody was cutting that corner to stay competitive.
ECC was first available in servers because customers demanded it.
I'm not sure, to this day, how much the various OSes actually support reporting
ECC corrections or how proactive they are isolating questionable memory.
I know the mainframes I worked on were able to hot swap out bad memory, which was
a big selling point (and required lots of OS support). Of course, these days
you can just migrate the VM to another host, but you still need to be able to
flag when bad things are happening.
> Jukka Larja (roskakori2006.delete@this.gmail.com) on January 1, 2021 10:28 pm wrote:
> >
> > So yeah, I do very much agree AMD has superior offering. ECC doesn't really matter here though.
>
> ECC absolutely matters.
>
> ECC availability matters a lot - exactly because Intel has been instrumental in
> killing the whole ECC industry with it's horribly bad market segmentation.
>
> Go out and search for ECC DIMMs - it's really hard to find. Yes - probably entirely thanks
> to AMD - it may have been gotten slightly better lately, but that's exactly my point.
>
> Intel has been detrimental to the whole industry and to users because
> of their bad and misguided policies wrt ECC. Seriously.
>
> And if you don't believe me, then just look at multiple generations of rowhammer, where each
> time Intel and memory manufacturers bleated about how it's going to be fixed next time.
>
> Narrator: "No it wasn't".
>
> And yes, that was - again - entirely about the misguided and arse-backwards policy
> of "consumers don't need ECC", which made the market for ECC memory go away.
>
> The arguments against ECC were always complete and utter garbage. Now even the memory manufacturers are
> starting do do ECC internally because they finally owned up to the fact that they absolutely have to.
>
> And the memory manufacturers claim it's because of economics and lower power. And they are
> lying bastards - let me once again point to row-hammer about how those problems have existed
> for several generations already, but these f*ckers happily sold broken hardware to consumers
> and claimed it was an "attack", when it always was "we're cutting corners".
>
> How many times has a row-hammer like bit-flip happened just by pure bad luck on real
> non-attack loads? We will never know. Because Intel was pushing shit to consumers.
>
> And I absolutely guarantee they happened. The "modern DRAM is so reliable that it doesn't need ECC"
> was always a bedtime story for children that had been dropped on their heads a bit too many times.
>
> We have decades of odd random kernel oopses that could never be explained and were likely due to
> bad memory. And if it causes a kernel oops, I can guarantee that there are several orders of magnitude
> more cases where it just caused a bit-flip that just never ended up being so critical.
>
> Yes, I'm pissed off about it. You can find me complaining about this literally for decades
> now. I don't want to say "I was right". I want this fixed, and I want ECC.
>
> And AMD did it. Intel didn't.
>
> > I don't really see AMD's unofficial ECC support being a big deal.
>
> I disagree. The difference between "the market for working memory actually exists" and "screw
> consumers over by selling them subtly unreliable hardware" is an absolutely enormous one.
>
> And the fact that it's "unofficial" for AMD doesn't matter. It works. And it allows
> the markets to - admittedly probably very slowly - start fixing themselves.
>
> But I blame Intel, because they were the big fish in the pond, and they were the
> ones that caused the ECC market to basically implode over a couple of decades.
>
> ECC DRAM (or just parity) used to be standard and easily accessible back when. ECC
> and parity isn't a new thing. It was literally killed by bad Intel policies.
>
> And don't let people tell you that DRAM got so reliable that it
> wasn't needed. That was never ever really true. See above.
>
> Linus
Parity existed on every motherboard until EDO DRAM got introduced.
I think there were a couple of factors:
1) EDO DRAM, IIRC, was produced in a x8 package only, there was no x1 (or x9) package.
(This made adding a parity bit difficult).
2) ECC is a pretty big step up (when EDO was introduced) in the memory controller,
easier to just not include it.
3) At the time EDO was introduced, memory was very expensive (I paid $300 for 8 Meg
at that point in time, of course memory prices crashed right after that :( ).
4) Once motherboard & memory controller (north bridge) vendors got away without supporting
parity for a couple of years, everybody was cutting that corner to stay competitive.
ECC was first available in servers because customers demanded it.
I'm not sure, to this day, how much the various OSes actually support reporting
ECC corrections or how proactive they are isolating questionable memory.
I know the mainframes I worked on were able to hot swap out bad memory, which was
a big selling point (and required lots of OS support). Of course, these days
you can just migrate the VM to another host, but you still need to be able to
flag when bad things are happening.