memory errors

By: rwessel (rwessel.delete@this.yahoo.com), March 5, 2021 1:32 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 5, 2021 11:59 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on March 5, 2021 4:06 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 4, 2021 8:47 pm wrote:
> > >
> > > All those MIS people want to know is which DIMM to replace, right?
> > >
> > > Linus
> >
> > Isn't it true not just for MIS (don't know what it means) but for me too, as workstation* user?
>
> You do want to replace a known bad DIMM, yes.
>
> But it's not always about a known bad DIMM at all.
>
> So you want to do SO MUCH MORE than just that with ECC reporting.
>
> So let's make a very concrete example: "Ok, one machine in the farm of a few hundred machines inexplicably
> hung, and was power-cycled by a hardware watchdog timer, and came back up, all happy again".
>
> Ans notice how I say "a few hundred machines". If you have a farm of millions of machines, you
> probably end up having very special hardware to help log this, and honestly, you don't even care
> about a single machine - you start caring only once you start seeing big patterns ("these machines
> we bought in the same batch have statistically significant behavior from other ones").
>
> But if you're a regular company, with some random internal set of machines, you probably do
> not have hugely specialized logging hardware outside of the trivial "system management ethernet
> network with virtual consoles and serial lines". Which help for the obvious problems, but are
> almost entirely useless for the "machine hung with no messages" kind of problems.
>
> Sure, you can continue to use that machine once it came back up, but what you really want to
> do is to have some automated post-mortem, and so what you'd like to do is save off the old memory
> image so that you have some idea of what happened. Was it the PCI bus that hung up because of
> some glitch, was it software that deadlocked with interrupts disabled, what was going on?
>
> So what you'd actually want to do is read the old memory contents as you bring it up again. Perhaps
> not all of it. Perhaps you just have a part of the memory set aside for some fast logging exactly so
> that in these situations you can at least see "what was the last thing the machine was doing".
>
> In fact, you'd like your standard OS image to just do that for you automatically,
> so that you're not actually even going to need hugely specialized knowledge.
>
> So you want to read all that memory, and it is going to give you ECC errors, because it hasn't
> been scrubbed and you cut power to it for a second (because (a) that's easy and (b) it's sometimes
> the only thing that really brings the machine back when the hardware is wedged).
>
> So you get all those ECC reports, and you're going to mostly ignore them, but
> they still might contain useful data even for the non-scrubbed memory.
>
> But wouldn't it be good to know which ones to ignore, and which ones not to ignore, because
> some of the ECC reports might be from the memory that you have scrubbed and is part of
> the newly booted image. Maybe that's literally the reason things went sideways?
>
> This can be really hard on lots of hardware, because quite often the ECC error reporting is really really
> badly done as I outlined. In this case, you don't care about which memory channel and DIMM and rank it
> was - because that is actually an almost completely random mapping of the actual virtual/physical address
> that the CPU instruction stream is using (because of interleaving both on a NUMA level and on a chip level,
> because of address bit remapping by hardware both on the motherboard and inside the CPU chip etc etc).
>
> And also, please take note of that "you want the standard OS image to do this all for you". This
> needs to be standard and architected. Not some kind of "on this particular hardware it works
> like X, and you look up the ECC details using these model-specific registers, and then translate
> them using this other machine-specific register, and then do a third translation using the BIOS
> tables described by this odd corner of the ACPI spec that nobody ever really tested and verified
> since the spec is 2500 pages long and there are so many other odd corners".
>
> See what I'm saying? ECC is just so much more than just "oh,
> your DIMM is bad, please replace the DIMM that is in slot 5".
>
> It's about "maybe I intentionally didn't scrub part of the memory, and it would be lovely
> to know how much of it actually I can trust even though it wasn't getting refreshed for
> a second, but honestly, I'll happily take even corrupted data and look for patterns".
>
> It's about "maybe I'm under a rowhammer type attack, and I really want to know where
> the accesses are coming from that cause these ECC bit flips, and what the CPU physical
> and virtual addresses were - I don't care one whit which DRAM chip it was!".
>
> It's about "maybe it's not one DIMM that is going bad, maybe I have some odd high background
> radiation levels that I didn't even think of, because it turns out I put my server
> in a basement and it turns out it has exceptionally high radon levels".
>
> See?
>
> But in reality, ECC error reporiting is a huge mess, and usually horribly badly done. It's almost
> never architected, so it's a "this chip family does this". It's very seldom well-designed, so
> it's often hard to impossible to do certain mapping of "I got an ECC error" to the source (usually
> people do make sure that you can figure out at least which DIMM slot it is in, so that the "replace
> this DIMM" at least works, but even that is often some black magic).
>
> And yes, I'll also very happily admit that systems like the Linux kernel don't necessarily do as good
> of a job as we could do. Because 99% of developers have no access to the facilities in the first place,
> can't really test it very well, and because it's not architected, it's not even "this is how you do
> it on x86" - it's "on this family of Xeon CPU's with this setup, this is how you do it".
>
> Is it any wonder that I'm frustrated with it?


Sure you don't want to come over to the dark side (Z*)? Machine check handling, including memory errors, have been architected since S/360. It'll even tell you it managed to roll back to before the failure.

*We have cookies!
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
CPU & Memory bit flipsGanon2021/03/03 10:05 AM
  Also "Silent Data Corruption"Adrian2021/03/03 11:42 AM
    Thanks for the referenceGanon2021/03/03 12:47 PM
  Implications for linux page cacheanon2021/03/03 12:54 PM
    Implications for linux page cacheLinus Torvalds2021/03/03 02:54 PM
      memory errorsblaine2021/03/03 03:53 PM
        memory errorsanon22021/03/03 06:30 PM
          memory errorsdmcq2021/03/04 06:16 AM
            memory errorsEtienne Lorrain2021/03/04 07:26 AM
              memory errorsdmcq2021/03/04 07:40 AM
                memory errorsEtienne Lorrain2021/03/04 07:58 AM
                  memory errorsdmcq2021/03/04 08:12 AM
                  memory errorsCarson2021/03/05 03:31 AM
                    memory errorsEtienne Lorrain2021/03/05 07:23 AM
                      memory errorsrwessel2021/03/05 08:48 AM
                      memory errorsdmcq2021/03/05 01:01 PM
                        memory errorsrwessel2021/03/05 01:23 PM
                          memory errorsdmcq2021/03/05 01:51 PM
                      memory errorsBrendan2021/03/06 12:38 AM
                      memory errorsCarson2021/03/06 02:35 AM
                        memory errorsCarson2021/03/06 07:24 AM
                memory errorsDavid Hess2021/03/04 02:44 PM
                  memory errorsrwessel2021/03/04 06:14 PM
                  memory errorsLinus Torvalds2021/03/04 09:21 PM
                    memory errorsanon22021/03/04 10:46 PM
                      memory errorsCarson2021/03/05 03:43 AM
                        memory errorsanon22021/03/05 08:55 AM
                    memory errorsgallier22021/03/05 03:22 AM
                  memory errorsdmcq2021/03/05 01:59 PM
                    memory errorsDavid Hess2021/03/06 05:27 AM
                      memory errorsCarson2021/03/06 07:44 AM
                      memory errorsGabriele Svelto2021/03/06 11:11 AM
                        memory errorsDavid Hess2021/03/06 11:28 AM
                          memory errorsMichael S2021/03/06 03:45 PM
              memory errorsDoug S2021/03/04 11:48 AM
                memory errorsMichael S2021/03/04 12:36 PM
              memory errorsJörn Engel2021/03/04 04:32 PM
                memory errorsLinus Torvalds2021/03/04 09:47 PM
                  memory errorsEtienne Lorrain2021/03/05 02:09 AM
                  memory errorsMichael S2021/03/05 05:06 AM
                    memory errorsLinus Torvalds2021/03/05 12:59 PM
                      memory errorsrwessel2021/03/05 01:32 PM
                        memory errorsrwessel2021/03/05 01:37 PM
                        memory errorszArchJon2021/03/06 09:39 PM
                      memory errorsGabriele Svelto2021/03/06 01:58 PM
                  memory errorsJörn Engel2021/03/05 11:12 AM
                Amiga recoverable RAM disk?Carson2021/03/05 04:03 AM
                  Thanks - TIL a cool Amiga feature (nt) (NT)John2021/03/05 01:51 PM
                    Another cool Amiga feature, datatypesCharles2021/03/06 01:01 AM
                      Another cool Amiga feature, datatypesJukka Larja2021/03/06 02:23 AM
                      Another cool Amiga feature, datatypesAnon2021/03/06 01:40 PM
                      Another cool Amiga feature, filesystemsMarcus2021/03/07 01:28 AM
  CPU & Memory bit flipszArchJon2021/03/04 07:39 AM
    CPU & Memory bit flipsdmcq2021/03/04 07:59 AM
      CPU & Memory bit flipsrwessel2021/03/04 01:27 PM
  speak of the devilRobert Williams2021/03/05 08:53 AM
    speak of the devildmcq2021/03/05 12:26 PM
      speak of the devilRobert Williams2021/03/05 04:15 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?