memory errors

By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), March 4, 2021 8:47 pm
Room: Moderated Discussions
Jörn Engel (joern.delete@this.purestorage.com) on March 4, 2021 3:32 pm wrote:
>
> Or you could not bother clearing memory. BIOS/Bootloaders need to initialize however much they
> use, the kernel needs to keep a bit per page to indicate whether memory has been cleared before
> and lazily do the work as memory gets allocated. We did that in a past project. It has the
> added benefit that you can leave crash dumps in RAM and retrieve them after reboot.

Amen. Please, don't clear all the memory at boot just to scrub it. And make sure there's a mode where your ECC errors aren't fatal or cause any problems - just log them. That way you can at least try to read old memory (exactly for things like crash dumps, but also just logs etc) and if the ECC logging is competent, the system can then expect to get ECC errors and also tell that "oh, it's that memory that I know was maybe not fully initialized yet, because I'm trying to see if the previous kernel left some clues around".

Of course, "competent error logging" can be something of a mythical beast - a unicorn or similar. It tends to be not be well architected (because hey, "ECC is special and not for everybody"), so the error logs you do get can be about things like "this channel and rank on this controller", and then it not necessarily all that obvious what actual physical RAM address it actually was.

Because that mapping ends up being dependent on the exact hardware, but also on esoteric memory setup fields that were likely programmed by the firmware and are quite possibly not documented anywhere.

And that's assuming you don't do the really insane things, and make ECC errors not recoverable in the first place, or hide them from the OS and relegate them to a special system management mode.

And/or make the machine check events be broadcast to every single CPU core in the system, because you decided that it's obviously way too much work and complexity to try to work back to the actual initiator of the memory access that was what then resulted in you getting a bad memory line.

And yes, yes, some of this is very much because it is complicated and subtle. The ECC error might not be directly and clearly tied to a particular read, because the CPU was actually doing pre-fetching, and the error happened and was noticed when reading things opportunistically and speculatively.

So instead of setting a poison bit in the cacheline or something (so that you might get a better report later), you just raise a general machine check error and say "Hey, I got an ECC error on this bank, and no, I won't tell you what it was that triggered it, and I won't even tell you what the physical address was as far as the CPU core is concerned".

Because the thing that actually notices the ECC error isn't actually working in those CPU instruction address terms at all, it's thinking in terms of memory channels and the mapping from one to the other has happened by a different piece of logic entirely, and that logic doesn't want to have anything to do with any nasty ECC problems.

Sometimes you basically might need to be a MIS person to figure things out. Normal mortals be damned.

Of course, it's a self-fulfilling prophecy. You only have ECC in those machines that already have a cadre of MIS people to figure those things out, so why even try to make it more straightforward? All those MIS people want to know is which DIMM to replace, right?

Linus
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
CPU & Memory bit flipsGanon2021/03/03 09:05 AM
  Also "Silent Data Corruption"Adrian2021/03/03 10:42 AM
    Thanks for the referenceGanon2021/03/03 11:47 AM
  Implications for linux page cacheanon2021/03/03 11:54 AM
    Implications for linux page cacheLinus Torvalds2021/03/03 01:54 PM
      memory errorsblaine2021/03/03 02:53 PM
        memory errorsanon22021/03/03 05:30 PM
          memory errorsdmcq2021/03/04 05:16 AM
            memory errorsEtienne Lorrain2021/03/04 06:26 AM
              memory errorsdmcq2021/03/04 06:40 AM
                memory errorsEtienne Lorrain2021/03/04 06:58 AM
                  memory errorsdmcq2021/03/04 07:12 AM
                  memory errorsCarson2021/03/05 02:31 AM
                    memory errorsEtienne Lorrain2021/03/05 06:23 AM
                      memory errorsrwessel2021/03/05 07:48 AM
                      memory errorsdmcq2021/03/05 12:01 PM
                        memory errorsrwessel2021/03/05 12:23 PM
                          memory errorsdmcq2021/03/05 12:51 PM
                      memory errorsBrendan2021/03/05 11:38 PM
                      memory errorsCarson2021/03/06 01:35 AM
                        memory errorsCarson2021/03/06 06:24 AM
                memory errorsDavid Hess2021/03/04 01:44 PM
                  memory errorsrwessel2021/03/04 05:14 PM
                  memory errorsLinus Torvalds2021/03/04 08:21 PM
                    memory errorsanon22021/03/04 09:46 PM
                      memory errorsCarson2021/03/05 02:43 AM
                        memory errorsanon22021/03/05 07:55 AM
                    memory errorsgallier22021/03/05 02:22 AM
                  memory errorsdmcq2021/03/05 12:59 PM
                    memory errorsDavid Hess2021/03/06 04:27 AM
                      memory errorsCarson2021/03/06 06:44 AM
                      memory errorsGabriele Svelto2021/03/06 10:11 AM
                        memory errorsDavid Hess2021/03/06 10:28 AM
                          memory errorsMichael S2021/03/06 02:45 PM
              memory errorsDoug S2021/03/04 10:48 AM
                memory errorsMichael S2021/03/04 11:36 AM
              memory errorsJörn Engel2021/03/04 03:32 PM
                memory errorsLinus Torvalds2021/03/04 08:47 PM
                  memory errorsEtienne Lorrain2021/03/05 01:09 AM
                  memory errorsMichael S2021/03/05 04:06 AM
                    memory errorsLinus Torvalds2021/03/05 11:59 AM
                      memory errorsrwessel2021/03/05 12:32 PM
                        memory errorsrwessel2021/03/05 12:37 PM
                        memory errorszArchJon2021/03/06 08:39 PM
                      memory errorsGabriele Svelto2021/03/06 12:58 PM
                  memory errorsJörn Engel2021/03/05 10:12 AM
                Amiga recoverable RAM disk?Carson2021/03/05 03:03 AM
                  Thanks - TIL a cool Amiga feature (nt) (NT)John2021/03/05 12:51 PM
                    Another cool Amiga feature, datatypesCharles2021/03/06 12:01 AM
                      Another cool Amiga feature, datatypesJukka Larja2021/03/06 01:23 AM
                      Another cool Amiga feature, datatypesAnon2021/03/06 12:40 PM
                      Another cool Amiga feature, filesystemsMarcus2021/03/07 12:28 AM
  CPU & Memory bit flipszArchJon2021/03/04 06:39 AM
    CPU & Memory bit flipsdmcq2021/03/04 06:59 AM
      CPU & Memory bit flipsrwessel2021/03/04 12:27 PM
  speak of the devilRobert Williams2021/03/05 07:53 AM
    speak of the devildmcq2021/03/05 11:26 AM
      speak of the devilRobert Williams2021/03/05 03:15 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?