memory errors

By: zArchJon (Anon.delete@this.anon.com), March 6, 2021 8:39 pm
Room: Moderated Discussions
rwessel (rwessel.delete@this.yahoo.com) on March 5, 2021 12:32 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 5, 2021 11:59 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on March 5, 2021 4:06 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 4, 2021 8:47 pm wrote:
> > > >
> > > > All those MIS people want to know is which DIMM to replace, right?
> > > >
> > > > Linus
> > >
> > > Isn't it true not just for MIS (don't know what it means) but for me too, as workstation* user?
> >
> > You do want to replace a known bad DIMM, yes.
> >
> > But it's not always about a known bad DIMM at all.
> >
> > So you want to do SO MUCH MORE than just that with ECC reporting.
> >
> > So let's make a very concrete example: "Ok, one machine in the farm of a few hundred machines inexplicably
> > hung, and was power-cycled by a hardware watchdog timer, and came back up, all happy again".
> >
> > Ans notice how I say "a few hundred machines". If you have a farm of millions of machines, you
> > probably end up having very special hardware to help log this, and honestly, you don't even care
> > about a single machine - you start caring only once you start seeing big patterns ("these machines
> > we bought in the same batch have statistically significant behavior from other ones").
> >
> > But if you're a regular company, with some random internal set of machines, you probably do
> > not have hugely specialized logging hardware outside of the trivial "system management ethernet
> > network with virtual consoles and serial lines". Which help for the obvious problems, but are
> > almost entirely useless for the "machine hung with no messages" kind of problems.
> >
> > Sure, you can continue to use that machine once it came back up, but what you really want to
> > do is to have some automated post-mortem, and so what you'd like to do is save off the old memory
> > image so that you have some idea of what happened. Was it the PCI bus that hung up because of
> > some glitch, was it software that deadlocked with interrupts disabled, what was going on?
> >
> > So what you'd actually want to do is read the old memory contents as you bring it up again. Perhaps
> > not all of it. Perhaps you just have a part of the memory set aside for some fast logging exactly so
> > that in these situations you can at least see "what was the last thing the machine was doing".
> >
> > In fact, you'd like your standard OS image to just do that for you automatically,
> > so that you're not actually even going to need hugely specialized knowledge.
> >
> > So you want to read all that memory, and it is going to give you ECC errors, because it hasn't
> > been scrubbed and you cut power to it for a second (because (a) that's easy and (b) it's sometimes
> > the only thing that really brings the machine back when the hardware is wedged).
> >
> > So you get all those ECC reports, and you're going to mostly ignore them, but
> > they still might contain useful data even for the non-scrubbed memory.
> >
> > But wouldn't it be good to know which ones to ignore, and which ones not to ignore, because
> > some of the ECC reports might be from the memory that you have scrubbed and is part of
> > the newly booted image. Maybe that's literally the reason things went sideways?
> >
> > This can be really hard on lots of hardware, because quite often the ECC error reporting is really really
> > badly done as I outlined. In this case, you don't care about which memory channel and DIMM and rank it
> > was - because that is actually an almost completely random mapping of the actual virtual/physical address
> > that the CPU instruction stream is using (because of interleaving both on a NUMA level and on a chip level,
> > because of address bit remapping by hardware both on the motherboard and inside the CPU chip etc etc).
> >
> > And also, please take note of that "you want the standard OS image to do this all for you". This
> > needs to be standard and architected. Not some kind of "on this particular hardware it works
> > like X, and you look up the ECC details using these model-specific registers, and then translate
> > them using this other machine-specific register, and then do a third translation using the BIOS
> > tables described by this odd corner of the ACPI spec that nobody ever really tested and verified
> > since the spec is 2500 pages long and there are so many other odd corners".
> >
> > See what I'm saying? ECC is just so much more than just "oh,
> > your DIMM is bad, please replace the DIMM that is in slot 5".
> >
> > It's about "maybe I intentionally didn't scrub part of the memory, and it would be lovely
> > to know how much of it actually I can trust even though it wasn't getting refreshed for
> > a second, but honestly, I'll happily take even corrupted data and look for patterns".
> >
> > It's about "maybe I'm under a rowhammer type attack, and I really want to know where
> > the accesses are coming from that cause these ECC bit flips, and what the CPU physical
> > and virtual addresses were - I don't care one whit which DRAM chip it was!".
> >
> > It's about "maybe it's not one DIMM that is going bad, maybe I have some odd high background
> > radiation levels that I didn't even think of, because it turns out I put my server
> > in a basement and it turns out it has exceptionally high radon levels".
> >
> > See?
> >
> > But in reality, ECC error reporiting is a huge mess, and usually horribly badly done. It's almost
> > never architected, so it's a "this chip family does this". It's very seldom well-designed, so
> > it's often hard to impossible to do certain mapping of "I got an ECC error" to the source (usually
> > people do make sure that you can figure out at least which DIMM slot it is in, so that the "replace
> > this DIMM" at least works, but even that is often some black magic).
> >
> > And yes, I'll also very happily admit that systems like the Linux kernel don't necessarily do as good
> > of a job as we could do. Because 99% of developers have no access to the facilities in the first place,
> > can't really test it very well, and because it's not architected, it's not even "this is how you do
> > it on x86" - it's "on this family of Xeon CPU's with this setup, this is how you do it".
> >
> > Is it any wonder that I'm frustrated with it?
>
>
> Sure you don't want to come over to the dark side (Z*)? Machine check handling, including memory errors,
> have been architected since S/360. It'll even tell you it managed to roll back to before the failure.
>
> *We have cookies!

I am glad there are others who appreciate the benefits of the machine check architecture on IBM Z. It is amazing how many things the S/360 architects got right. There is also the TEST BLOCK instruction which will validate a 4-K page and clear up any ECC (or in IBM Z language "checking-block code") errors. Useful if a machine check reported a failing storage address and you want to attempt to see if it was a one time problem or the memory has a hard error.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
CPU & Memory bit flipsGanon2021/03/03 09:05 AM
  Also "Silent Data Corruption"Adrian2021/03/03 10:42 AM
    Thanks for the referenceGanon2021/03/03 11:47 AM
  Implications for linux page cacheanon2021/03/03 11:54 AM
    Implications for linux page cacheLinus Torvalds2021/03/03 01:54 PM
      memory errorsblaine2021/03/03 02:53 PM
        memory errorsanon22021/03/03 05:30 PM
          memory errorsdmcq2021/03/04 05:16 AM
            memory errorsEtienne Lorrain2021/03/04 06:26 AM
              memory errorsdmcq2021/03/04 06:40 AM
                memory errorsEtienne Lorrain2021/03/04 06:58 AM
                  memory errorsdmcq2021/03/04 07:12 AM
                  memory errorsCarson2021/03/05 02:31 AM
                    memory errorsEtienne Lorrain2021/03/05 06:23 AM
                      memory errorsrwessel2021/03/05 07:48 AM
                      memory errorsdmcq2021/03/05 12:01 PM
                        memory errorsrwessel2021/03/05 12:23 PM
                          memory errorsdmcq2021/03/05 12:51 PM
                      memory errorsBrendan2021/03/05 11:38 PM
                      memory errorsCarson2021/03/06 01:35 AM
                        memory errorsCarson2021/03/06 06:24 AM
                memory errorsDavid Hess2021/03/04 01:44 PM
                  memory errorsrwessel2021/03/04 05:14 PM
                  memory errorsLinus Torvalds2021/03/04 08:21 PM
                    memory errorsanon22021/03/04 09:46 PM
                      memory errorsCarson2021/03/05 02:43 AM
                        memory errorsanon22021/03/05 07:55 AM
                    memory errorsgallier22021/03/05 02:22 AM
                  memory errorsdmcq2021/03/05 12:59 PM
                    memory errorsDavid Hess2021/03/06 04:27 AM
                      memory errorsCarson2021/03/06 06:44 AM
                      memory errorsGabriele Svelto2021/03/06 10:11 AM
                        memory errorsDavid Hess2021/03/06 10:28 AM
                          memory errorsMichael S2021/03/06 02:45 PM
              memory errorsDoug S2021/03/04 10:48 AM
                memory errorsMichael S2021/03/04 11:36 AM
              memory errorsJörn Engel2021/03/04 03:32 PM
                memory errorsLinus Torvalds2021/03/04 08:47 PM
                  memory errorsEtienne Lorrain2021/03/05 01:09 AM
                  memory errorsMichael S2021/03/05 04:06 AM
                    memory errorsLinus Torvalds2021/03/05 11:59 AM
                      memory errorsrwessel2021/03/05 12:32 PM
                        memory errorsrwessel2021/03/05 12:37 PM
                        memory errorszArchJon2021/03/06 08:39 PM
                      memory errorsGabriele Svelto2021/03/06 12:58 PM
                  memory errorsJörn Engel2021/03/05 10:12 AM
                Amiga recoverable RAM disk?Carson2021/03/05 03:03 AM
                  Thanks - TIL a cool Amiga feature (nt) (NT)John2021/03/05 12:51 PM
                    Another cool Amiga feature, datatypesCharles2021/03/06 12:01 AM
                      Another cool Amiga feature, datatypesJukka Larja2021/03/06 01:23 AM
                      Another cool Amiga feature, datatypesAnon2021/03/06 12:40 PM
                      Another cool Amiga feature, filesystemsMarcus2021/03/07 12:28 AM
  CPU & Memory bit flipszArchJon2021/03/04 06:39 AM
    CPU & Memory bit flipsdmcq2021/03/04 06:59 AM
      CPU & Memory bit flipsrwessel2021/03/04 12:27 PM
  speak of the devilRobert Williams2021/03/05 07:53 AM
    speak of the devildmcq2021/03/05 11:26 AM
      speak of the devilRobert Williams2021/03/05 03:15 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?