memory errors

By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), March 5, 2021 12:59 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on March 5, 2021 4:06 am wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 4, 2021 8:47 pm wrote:
> >
> > All those MIS people want to know is which DIMM to replace, right?
> >
> > Linus
>
> Isn't it true not just for MIS (don't know what it means) but for me too, as workstation* user?

You do want to replace a known bad DIMM, yes.

But it's not always about a known bad DIMM at all.

So you want to do SO MUCH MORE than just that with ECC reporting.

So let's make a very concrete example: "Ok, one machine in the farm of a few hundred machines inexplicably hung, and was power-cycled by a hardware watchdog timer, and came back up, all happy again".

Ans notice how I say "a few hundred machines". If you have a farm of millions of machines, you probably end up having very special hardware to help log this, and honestly, you don't even care about a single machine - you start caring only once you start seeing big patterns ("these machines we bought in the same batch have statistically significant behavior from other ones").

But if you're a regular company, with some random internal set of machines, you probably do not have hugely specialized logging hardware outside of the trivial "system management ethernet network with virtual consoles and serial lines". Which help for the obvious problems, but are almost entirely useless for the "machine hung with no messages" kind of problems.

Sure, you can continue to use that machine once it came back up, but what you really want to do is to have some automated post-mortem, and so what you'd like to do is save off the old memory image so that you have some idea of what happened. Was it the PCI bus that hung up because of some glitch, was it software that deadlocked with interrupts disabled, what was going on?

So what you'd actually want to do is read the old memory contents as you bring it up again. Perhaps not all of it. Perhaps you just have a part of the memory set aside for some fast logging exactly so that in these situations you can at least see "what was the last thing the machine was doing".

In fact, you'd like your standard OS image to just do that for you automatically, so that you're not actually even going to need hugely specialized knowledge.

So you want to read all that memory, and it is going to give you ECC errors, because it hasn't been scrubbed and you cut power to it for a second (because (a) that's easy and (b) it's sometimes the only thing that really brings the machine back when the hardware is wedged).

So you get all those ECC reports, and you're going to mostly ignore them, but they still might contain useful data even for the non-scrubbed memory.

But wouldn't it be good to know which ones to ignore, and which ones not to ignore, because some of the ECC reports might be from the memory that you have scrubbed and is part of the newly booted image. Maybe that's literally the reason things went sideways?

This can be really hard on lots of hardware, because quite often the ECC error reporting is really really badly done as I outlined. In this case, you don't care about which memory channel and DIMM and rank it was - because that is actually an almost completely random mapping of the actual virtual/physical address that the CPU instruction stream is using (because of interleaving both on a NUMA level and on a chip level, because of address bit remapping by hardware both on the motherboard and inside the CPU chip etc etc).

And also, please take note of that "you want the standard OS image to do this all for you". This needs to be standard and architected. Not some kind of "on this particular hardware it works like X, and you look up the ECC details using these model-specific registers, and then translate them using this other machine-specific register, and then do a third translation using the BIOS tables described by this odd corner of the ACPI spec that nobody ever really tested and verified since the spec is 2500 pages long and there are so many other odd corners".

See what I'm saying? ECC is just so much more than just "oh, your DIMM is bad, please replace the DIMM that is in slot 5".

It's about "maybe I intentionally didn't scrub part of the memory, and it would be lovely to know how much of it actually I can trust even though it wasn't getting refreshed for a second, but honestly, I'll happily take even corrupted data and look for patterns".

It's about "maybe I'm under a rowhammer type attack, and I really want to know where the accesses are coming from that cause these ECC bit flips, and what the CPU physical and virtual addresses were - I don't care one whit which DRAM chip it was!".

It's about "maybe it's not one DIMM that is going bad, maybe I have some odd high background radiation levels that I didn't even think of, because it turns out I put my server in a basement and it turns out it has exceptionally high radon levels".

See?

But in reality, ECC error reporiting is a huge mess, and usually horribly badly done. It's almost never architected, so it's a "this chip family does this". It's very seldom well-designed, so it's often hard to impossible to do certain mapping of "I got an ECC error" to the source (usually people do make sure that you can figure out at least which DIMM slot it is in, so that the "replace this DIMM" at least works, but even that is often some black magic).

And yes, I'll also very happily admit that systems like the Linux kernel don't necessarily do as good of a job as we could do. Because 99% of developers have no access to the facilities in the first place, can't really test it very well, and because it's not architected, it's not even "this is how you do it on x86" - it's "on this family of Xeon CPU's with this setup, this is how you do it".

Is it any wonder that I'm frustrated with it?

Linus
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
CPU & Memory bit flipsGanon2021/03/03 10:05 AM
  Also "Silent Data Corruption"Adrian2021/03/03 11:42 AM
    Thanks for the referenceGanon2021/03/03 12:47 PM
  Implications for linux page cacheanon2021/03/03 12:54 PM
    Implications for linux page cacheLinus Torvalds2021/03/03 02:54 PM
      memory errorsblaine2021/03/03 03:53 PM
        memory errorsanon22021/03/03 06:30 PM
          memory errorsdmcq2021/03/04 06:16 AM
            memory errorsEtienne Lorrain2021/03/04 07:26 AM
              memory errorsdmcq2021/03/04 07:40 AM
                memory errorsEtienne Lorrain2021/03/04 07:58 AM
                  memory errorsdmcq2021/03/04 08:12 AM
                  memory errorsCarson2021/03/05 03:31 AM
                    memory errorsEtienne Lorrain2021/03/05 07:23 AM
                      memory errorsrwessel2021/03/05 08:48 AM
                      memory errorsdmcq2021/03/05 01:01 PM
                        memory errorsrwessel2021/03/05 01:23 PM
                          memory errorsdmcq2021/03/05 01:51 PM
                      memory errorsBrendan2021/03/06 12:38 AM
                      memory errorsCarson2021/03/06 02:35 AM
                        memory errorsCarson2021/03/06 07:24 AM
                memory errorsDavid Hess2021/03/04 02:44 PM
                  memory errorsrwessel2021/03/04 06:14 PM
                  memory errorsLinus Torvalds2021/03/04 09:21 PM
                    memory errorsanon22021/03/04 10:46 PM
                      memory errorsCarson2021/03/05 03:43 AM
                        memory errorsanon22021/03/05 08:55 AM
                    memory errorsgallier22021/03/05 03:22 AM
                  memory errorsdmcq2021/03/05 01:59 PM
                    memory errorsDavid Hess2021/03/06 05:27 AM
                      memory errorsCarson2021/03/06 07:44 AM
                      memory errorsGabriele Svelto2021/03/06 11:11 AM
                        memory errorsDavid Hess2021/03/06 11:28 AM
                          memory errorsMichael S2021/03/06 03:45 PM
              memory errorsDoug S2021/03/04 11:48 AM
                memory errorsMichael S2021/03/04 12:36 PM
              memory errorsJörn Engel2021/03/04 04:32 PM
                memory errorsLinus Torvalds2021/03/04 09:47 PM
                  memory errorsEtienne Lorrain2021/03/05 02:09 AM
                  memory errorsMichael S2021/03/05 05:06 AM
                    memory errorsLinus Torvalds2021/03/05 12:59 PM
                      memory errorsrwessel2021/03/05 01:32 PM
                        memory errorsrwessel2021/03/05 01:37 PM
                        memory errorszArchJon2021/03/06 09:39 PM
                      memory errorsGabriele Svelto2021/03/06 01:58 PM
                  memory errorsJörn Engel2021/03/05 11:12 AM
                Amiga recoverable RAM disk?Carson2021/03/05 04:03 AM
                  Thanks - TIL a cool Amiga feature (nt) (NT)John2021/03/05 01:51 PM
                    Another cool Amiga feature, datatypesCharles2021/03/06 01:01 AM
                      Another cool Amiga feature, datatypesJukka Larja2021/03/06 02:23 AM
                      Another cool Amiga feature, datatypesAnon2021/03/06 01:40 PM
                      Another cool Amiga feature, filesystemsMarcus2021/03/07 01:28 AM
  CPU & Memory bit flipszArchJon2021/03/04 07:39 AM
    CPU & Memory bit flipsdmcq2021/03/04 07:59 AM
      CPU & Memory bit flipsrwessel2021/03/04 01:27 PM
  speak of the devilRobert Williams2021/03/05 08:53 AM
    speak of the devildmcq2021/03/05 12:26 PM
      speak of the devilRobert Williams2021/03/05 04:15 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?