By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), March 4, 2021 9:47 pm
Room: Moderated Discussions
Jörn Engel (joern.delete@this.purestorage.com) on March 4, 2021 3:32 pm wrote:
>
> Or you could not bother clearing memory. BIOS/Bootloaders need to initialize however much they
> use, the kernel needs to keep a bit per page to indicate whether memory has been cleared before
> and lazily do the work as memory gets allocated. We did that in a past project. It has the
> added benefit that you can leave crash dumps in RAM and retrieve them after reboot.
Amen. Please, don't clear all the memory at boot just to scrub it. And make sure there's a mode where your ECC errors aren't fatal or cause any problems - just log them. That way you can at least try to read old memory (exactly for things like crash dumps, but also just logs etc) and if the ECC logging is competent, the system can then expect to get ECC errors and also tell that "oh, it's that memory that I know was maybe not fully initialized yet, because I'm trying to see if the previous kernel left some clues around".
Of course, "competent error logging" can be something of a mythical beast - a unicorn or similar. It tends to be not be well architected (because hey, "ECC is special and not for everybody"), so the error logs you do get can be about things like "this channel and rank on this controller", and then it not necessarily all that obvious what actual physical RAM address it actually was.
Because that mapping ends up being dependent on the exact hardware, but also on esoteric memory setup fields that were likely programmed by the firmware and are quite possibly not documented anywhere.
And that's assuming you don't do the really insane things, and make ECC errors not recoverable in the first place, or hide them from the OS and relegate them to a special system management mode.
And/or make the machine check events be broadcast to every single CPU core in the system, because you decided that it's obviously way too much work and complexity to try to work back to the actual initiator of the memory access that was what then resulted in you getting a bad memory line.
And yes, yes, some of this is very much because it is complicated and subtle. The ECC error might not be directly and clearly tied to a particular read, because the CPU was actually doing pre-fetching, and the error happened and was noticed when reading things opportunistically and speculatively.
So instead of setting a poison bit in the cacheline or something (so that you might get a better report later), you just raise a general machine check error and say "Hey, I got an ECC error on this bank, and no, I won't tell you what it was that triggered it, and I won't even tell you what the physical address was as far as the CPU core is concerned".
Because the thing that actually notices the ECC error isn't actually working in those CPU instruction address terms at all, it's thinking in terms of memory channels and the mapping from one to the other has happened by a different piece of logic entirely, and that logic doesn't want to have anything to do with any nasty ECC problems.
Sometimes you basically might need to be a MIS person to figure things out. Normal mortals be damned.
Of course, it's a self-fulfilling prophecy. You only have ECC in those machines that already have a cadre of MIS people to figure those things out, so why even try to make it more straightforward? All those MIS people want to know is which DIMM to replace, right?
Linus
>
> Or you could not bother clearing memory. BIOS/Bootloaders need to initialize however much they
> use, the kernel needs to keep a bit per page to indicate whether memory has been cleared before
> and lazily do the work as memory gets allocated. We did that in a past project. It has the
> added benefit that you can leave crash dumps in RAM and retrieve them after reboot.
Amen. Please, don't clear all the memory at boot just to scrub it. And make sure there's a mode where your ECC errors aren't fatal or cause any problems - just log them. That way you can at least try to read old memory (exactly for things like crash dumps, but also just logs etc) and if the ECC logging is competent, the system can then expect to get ECC errors and also tell that "oh, it's that memory that I know was maybe not fully initialized yet, because I'm trying to see if the previous kernel left some clues around".
Of course, "competent error logging" can be something of a mythical beast - a unicorn or similar. It tends to be not be well architected (because hey, "ECC is special and not for everybody"), so the error logs you do get can be about things like "this channel and rank on this controller", and then it not necessarily all that obvious what actual physical RAM address it actually was.
Because that mapping ends up being dependent on the exact hardware, but also on esoteric memory setup fields that were likely programmed by the firmware and are quite possibly not documented anywhere.
And that's assuming you don't do the really insane things, and make ECC errors not recoverable in the first place, or hide them from the OS and relegate them to a special system management mode.
And/or make the machine check events be broadcast to every single CPU core in the system, because you decided that it's obviously way too much work and complexity to try to work back to the actual initiator of the memory access that was what then resulted in you getting a bad memory line.
And yes, yes, some of this is very much because it is complicated and subtle. The ECC error might not be directly and clearly tied to a particular read, because the CPU was actually doing pre-fetching, and the error happened and was noticed when reading things opportunistically and speculatively.
So instead of setting a poison bit in the cacheline or something (so that you might get a better report later), you just raise a general machine check error and say "Hey, I got an ECC error on this bank, and no, I won't tell you what it was that triggered it, and I won't even tell you what the physical address was as far as the CPU core is concerned".
Because the thing that actually notices the ECC error isn't actually working in those CPU instruction address terms at all, it's thinking in terms of memory channels and the mapping from one to the other has happened by a different piece of logic entirely, and that logic doesn't want to have anything to do with any nasty ECC problems.
Sometimes you basically might need to be a MIS person to figure things out. Normal mortals be damned.
Of course, it's a self-fulfilling prophecy. You only have ECC in those machines that already have a cadre of MIS people to figure those things out, so why even try to make it more straightforward? All those MIS people want to know is which DIMM to replace, right?
Linus
Topic | Posted By | Date |
---|---|---|
CPU & Memory bit flips | Ganon | 2021/03/03 10:05 AM |
Also "Silent Data Corruption" | Adrian | 2021/03/03 11:42 AM |
Thanks for the reference | Ganon | 2021/03/03 12:47 PM |
Implications for linux page cache | anon | 2021/03/03 12:54 PM |
Implications for linux page cache | Linus Torvalds | 2021/03/03 02:54 PM |
memory errors | blaine | 2021/03/03 03:53 PM |
memory errors | anon2 | 2021/03/03 06:30 PM |
memory errors | dmcq | 2021/03/04 06:16 AM |
memory errors | Etienne Lorrain | 2021/03/04 07:26 AM |
memory errors | dmcq | 2021/03/04 07:40 AM |
memory errors | Etienne Lorrain | 2021/03/04 07:58 AM |
memory errors | dmcq | 2021/03/04 08:12 AM |
memory errors | Carson | 2021/03/05 03:31 AM |
memory errors | Etienne Lorrain | 2021/03/05 07:23 AM |
memory errors | rwessel | 2021/03/05 08:48 AM |
memory errors | dmcq | 2021/03/05 01:01 PM |
memory errors | rwessel | 2021/03/05 01:23 PM |
memory errors | dmcq | 2021/03/05 01:51 PM |
memory errors | Brendan | 2021/03/06 12:38 AM |
memory errors | Carson | 2021/03/06 02:35 AM |
memory errors | Carson | 2021/03/06 07:24 AM |
memory errors | David Hess | 2021/03/04 02:44 PM |
memory errors | rwessel | 2021/03/04 06:14 PM |
memory errors | Linus Torvalds | 2021/03/04 09:21 PM |
memory errors | anon2 | 2021/03/04 10:46 PM |
memory errors | Carson | 2021/03/05 03:43 AM |
memory errors | anon2 | 2021/03/05 08:55 AM |
memory errors | gallier2 | 2021/03/05 03:22 AM |
memory errors | dmcq | 2021/03/05 01:59 PM |
memory errors | David Hess | 2021/03/06 05:27 AM |
memory errors | Carson | 2021/03/06 07:44 AM |
memory errors | Gabriele Svelto | 2021/03/06 11:11 AM |
memory errors | David Hess | 2021/03/06 11:28 AM |
memory errors | Michael S | 2021/03/06 03:45 PM |
memory errors | Doug S | 2021/03/04 11:48 AM |
memory errors | Michael S | 2021/03/04 12:36 PM |
memory errors | Jörn Engel | 2021/03/04 04:32 PM |
memory errors | Linus Torvalds | 2021/03/04 09:47 PM |
memory errors | Etienne Lorrain | 2021/03/05 02:09 AM |
memory errors | Michael S | 2021/03/05 05:06 AM |
memory errors | Linus Torvalds | 2021/03/05 12:59 PM |
memory errors | rwessel | 2021/03/05 01:32 PM |
memory errors | rwessel | 2021/03/05 01:37 PM |
memory errors | zArchJon | 2021/03/06 09:39 PM |
memory errors | Gabriele Svelto | 2021/03/06 01:58 PM |
memory errors | Jörn Engel | 2021/03/05 11:12 AM |
Amiga recoverable RAM disk? | Carson | 2021/03/05 04:03 AM |
Thanks - TIL a cool Amiga feature (nt) (NT) | John | 2021/03/05 01:51 PM |
Another cool Amiga feature, datatypes | Charles | 2021/03/06 01:01 AM |
Another cool Amiga feature, datatypes | Jukka Larja | 2021/03/06 02:23 AM |
Another cool Amiga feature, datatypes | Anon | 2021/03/06 01:40 PM |
Another cool Amiga feature, filesystems | Marcus | 2021/03/07 01:28 AM |
CPU & Memory bit flips | zArchJon | 2021/03/04 07:39 AM |
CPU & Memory bit flips | dmcq | 2021/03/04 07:59 AM |
CPU & Memory bit flips | rwessel | 2021/03/04 01:27 PM |
speak of the devil | Robert Williams | 2021/03/05 08:53 AM |
speak of the devil | dmcq | 2021/03/05 12:26 PM |
speak of the devil | Robert Williams | 2021/03/05 04:15 PM |