By: Chester (lamchester.delete@this.gmail.com), January 7, 2021 5:00 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on January 6, 2021 11:38 am wrote:
> ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on January 6, 2021 9:45 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on January 5, 2021 12:29 pm wrote:
> > > ECC is safer under normal circumstances,
> >
> > What is "normal circumstances"? From a mathematical viewpoint, ECC DDR4 modules can afford to be of lower
> > quality
>
> Bullshit.
>
> ECC safety isn't about the "correctable" part.
>
> Why don't people get that? The correction part of ECC is almost irrelevant.
>
> In fact, five lines later, you ask for the OS to do checksumming for DRAM problems, because you seem to realize
> that the only thing that really matters is reporting whether the memory you use is reliable or not.
>
> That is why you need ECC. Not for correction. For knowing whether your machine
> is reliable or not. Without ECC, you're basically screwed. You have no idea.
>
> (And yes, I've said it before, and I'll say it again: parity is almost as good as ECC. Exactly because
> parity does the important part - not as well, no, but certainly a lot better than nothing).
>
> And no, it's not the job of the OS to fix broken hardware. Doing checksums of disk contents
> is one thing (but honestly, the disks themselves had better have those checksums internally
> anyway, and they do), but doing "software ECC" is just you desperately trying to make
> excuses and make up and argument that is complete and utter garbage.
>
> And btw, don't talk to me about uncorrectable errors, or - worse yet - about undetectable
> three-bit flips, which is inevitably the next stage of denial. Do they happen? Sure. But
> the normal single-bit flips will happen before they do, and honestly, the whole argument
> of "but nothing is perfect" isn't an argument at all, it's just pure and utter stupidity.
>
> So stop the idiocy already.
>
> Linus
Except if we're talking about malicious attacks, researchers figured out they could flip three bits and cause an undetectable error. Why not tighten refresh timings until the attack no longer works?
For random bit flips, yeah ECC will do the job. But I wonder how many of those random bit flips happened because of an unlucky access pattern (which would be fixed by refreshing more often).
> ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on January 6, 2021 9:45 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on January 5, 2021 12:29 pm wrote:
> > > ECC is safer under normal circumstances,
> >
> > What is "normal circumstances"? From a mathematical viewpoint, ECC DDR4 modules can afford to be of lower
> > quality
>
> Bullshit.
>
> ECC safety isn't about the "correctable" part.
>
> Why don't people get that? The correction part of ECC is almost irrelevant.
>
> In fact, five lines later, you ask for the OS to do checksumming for DRAM problems, because you seem to realize
> that the only thing that really matters is reporting whether the memory you use is reliable or not.
>
> That is why you need ECC. Not for correction. For knowing whether your machine
> is reliable or not. Without ECC, you're basically screwed. You have no idea.
>
> (And yes, I've said it before, and I'll say it again: parity is almost as good as ECC. Exactly because
> parity does the important part - not as well, no, but certainly a lot better than nothing).
>
> And no, it's not the job of the OS to fix broken hardware. Doing checksums of disk contents
> is one thing (but honestly, the disks themselves had better have those checksums internally
> anyway, and they do), but doing "software ECC" is just you desperately trying to make
> excuses and make up and argument that is complete and utter garbage.
>
> And btw, don't talk to me about uncorrectable errors, or - worse yet - about undetectable
> three-bit flips, which is inevitably the next stage of denial. Do they happen? Sure. But
> the normal single-bit flips will happen before they do, and honestly, the whole argument
> of "but nothing is perfect" isn't an argument at all, it's just pure and utter stupidity.
>
> So stop the idiocy already.
>
> Linus
Except if we're talking about malicious attacks, researchers figured out they could flip three bits and cause an undetectable error. Why not tighten refresh timings until the attack no longer works?
For random bit flips, yeah ECC will do the job. But I wonder how many of those random bit flips happened because of an unlucky access pattern (which would be fixed by refreshing more often).