By: zArchJon (Anon.delete@this.anon.com), March 6, 2021 9:39 pm
Room: Moderated Discussions
rwessel (rwessel.delete@this.yahoo.com) on March 5, 2021 12:32 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 5, 2021 11:59 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on March 5, 2021 4:06 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 4, 2021 8:47 pm wrote:
> > > >
> > > > All those MIS people want to know is which DIMM to replace, right?
> > > >
> > > > Linus
> > >
> > > Isn't it true not just for MIS (don't know what it means) but for me too, as workstation* user?
> >
> > You do want to replace a known bad DIMM, yes.
> >
> > But it's not always about a known bad DIMM at all.
> >
> > So you want to do SO MUCH MORE than just that with ECC reporting.
> >
> > So let's make a very concrete example: "Ok, one machine in the farm of a few hundred machines inexplicably
> > hung, and was power-cycled by a hardware watchdog timer, and came back up, all happy again".
> >
> > Ans notice how I say "a few hundred machines". If you have a farm of millions of machines, you
> > probably end up having very special hardware to help log this, and honestly, you don't even care
> > about a single machine - you start caring only once you start seeing big patterns ("these machines
> > we bought in the same batch have statistically significant behavior from other ones").
> >
> > But if you're a regular company, with some random internal set of machines, you probably do
> > not have hugely specialized logging hardware outside of the trivial "system management ethernet
> > network with virtual consoles and serial lines". Which help for the obvious problems, but are
> > almost entirely useless for the "machine hung with no messages" kind of problems.
> >
> > Sure, you can continue to use that machine once it came back up, but what you really want to
> > do is to have some automated post-mortem, and so what you'd like to do is save off the old memory
> > image so that you have some idea of what happened. Was it the PCI bus that hung up because of
> > some glitch, was it software that deadlocked with interrupts disabled, what was going on?
> >
> > So what you'd actually want to do is read the old memory contents as you bring it up again. Perhaps
> > not all of it. Perhaps you just have a part of the memory set aside for some fast logging exactly so
> > that in these situations you can at least see "what was the last thing the machine was doing".
> >
> > In fact, you'd like your standard OS image to just do that for you automatically,
> > so that you're not actually even going to need hugely specialized knowledge.
> >
> > So you want to read all that memory, and it is going to give you ECC errors, because it hasn't
> > been scrubbed and you cut power to it for a second (because (a) that's easy and (b) it's sometimes
> > the only thing that really brings the machine back when the hardware is wedged).
> >
> > So you get all those ECC reports, and you're going to mostly ignore them, but
> > they still might contain useful data even for the non-scrubbed memory.
> >
> > But wouldn't it be good to know which ones to ignore, and which ones not to ignore, because
> > some of the ECC reports might be from the memory that you have scrubbed and is part of
> > the newly booted image. Maybe that's literally the reason things went sideways?
> >
> > This can be really hard on lots of hardware, because quite often the ECC error reporting is really really
> > badly done as I outlined. In this case, you don't care about which memory channel and DIMM and rank it
> > was - because that is actually an almost completely random mapping of the actual virtual/physical address
> > that the CPU instruction stream is using (because of interleaving both on a NUMA level and on a chip level,
> > because of address bit remapping by hardware both on the motherboard and inside the CPU chip etc etc).
> >
> > And also, please take note of that "you want the standard OS image to do this all for you". This
> > needs to be standard and architected. Not some kind of "on this particular hardware it works
> > like X, and you look up the ECC details using these model-specific registers, and then translate
> > them using this other machine-specific register, and then do a third translation using the BIOS
> > tables described by this odd corner of the ACPI spec that nobody ever really tested and verified
> > since the spec is 2500 pages long and there are so many other odd corners".
> >
> > See what I'm saying? ECC is just so much more than just "oh,
> > your DIMM is bad, please replace the DIMM that is in slot 5".
> >
> > It's about "maybe I intentionally didn't scrub part of the memory, and it would be lovely
> > to know how much of it actually I can trust even though it wasn't getting refreshed for
> > a second, but honestly, I'll happily take even corrupted data and look for patterns".
> >
> > It's about "maybe I'm under a rowhammer type attack, and I really want to know where
> > the accesses are coming from that cause these ECC bit flips, and what the CPU physical
> > and virtual addresses were - I don't care one whit which DRAM chip it was!".
> >
> > It's about "maybe it's not one DIMM that is going bad, maybe I have some odd high background
> > radiation levels that I didn't even think of, because it turns out I put my server
> > in a basement and it turns out it has exceptionally high radon levels".
> >
> > See?
> >
> > But in reality, ECC error reporiting is a huge mess, and usually horribly badly done. It's almost
> > never architected, so it's a "this chip family does this". It's very seldom well-designed, so
> > it's often hard to impossible to do certain mapping of "I got an ECC error" to the source (usually
> > people do make sure that you can figure out at least which DIMM slot it is in, so that the "replace
> > this DIMM" at least works, but even that is often some black magic).
> >
> > And yes, I'll also very happily admit that systems like the Linux kernel don't necessarily do as good
> > of a job as we could do. Because 99% of developers have no access to the facilities in the first place,
> > can't really test it very well, and because it's not architected, it's not even "this is how you do
> > it on x86" - it's "on this family of Xeon CPU's with this setup, this is how you do it".
> >
> > Is it any wonder that I'm frustrated with it?
>
>
> Sure you don't want to come over to the dark side (Z*)? Machine check handling, including memory errors,
> have been architected since S/360. It'll even tell you it managed to roll back to before the failure.
>
> *We have cookies!
I am glad there are others who appreciate the benefits of the machine check architecture on IBM Z. It is amazing how many things the S/360 architects got right. There is also the TEST BLOCK instruction which will validate a 4-K page and clear up any ECC (or in IBM Z language "checking-block code") errors. Useful if a machine check reported a failing storage address and you want to attempt to see if it was a one time problem or the memory has a hard error.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 5, 2021 11:59 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on March 5, 2021 4:06 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 4, 2021 8:47 pm wrote:
> > > >
> > > > All those MIS people want to know is which DIMM to replace, right?
> > > >
> > > > Linus
> > >
> > > Isn't it true not just for MIS (don't know what it means) but for me too, as workstation* user?
> >
> > You do want to replace a known bad DIMM, yes.
> >
> > But it's not always about a known bad DIMM at all.
> >
> > So you want to do SO MUCH MORE than just that with ECC reporting.
> >
> > So let's make a very concrete example: "Ok, one machine in the farm of a few hundred machines inexplicably
> > hung, and was power-cycled by a hardware watchdog timer, and came back up, all happy again".
> >
> > Ans notice how I say "a few hundred machines". If you have a farm of millions of machines, you
> > probably end up having very special hardware to help log this, and honestly, you don't even care
> > about a single machine - you start caring only once you start seeing big patterns ("these machines
> > we bought in the same batch have statistically significant behavior from other ones").
> >
> > But if you're a regular company, with some random internal set of machines, you probably do
> > not have hugely specialized logging hardware outside of the trivial "system management ethernet
> > network with virtual consoles and serial lines". Which help for the obvious problems, but are
> > almost entirely useless for the "machine hung with no messages" kind of problems.
> >
> > Sure, you can continue to use that machine once it came back up, but what you really want to
> > do is to have some automated post-mortem, and so what you'd like to do is save off the old memory
> > image so that you have some idea of what happened. Was it the PCI bus that hung up because of
> > some glitch, was it software that deadlocked with interrupts disabled, what was going on?
> >
> > So what you'd actually want to do is read the old memory contents as you bring it up again. Perhaps
> > not all of it. Perhaps you just have a part of the memory set aside for some fast logging exactly so
> > that in these situations you can at least see "what was the last thing the machine was doing".
> >
> > In fact, you'd like your standard OS image to just do that for you automatically,
> > so that you're not actually even going to need hugely specialized knowledge.
> >
> > So you want to read all that memory, and it is going to give you ECC errors, because it hasn't
> > been scrubbed and you cut power to it for a second (because (a) that's easy and (b) it's sometimes
> > the only thing that really brings the machine back when the hardware is wedged).
> >
> > So you get all those ECC reports, and you're going to mostly ignore them, but
> > they still might contain useful data even for the non-scrubbed memory.
> >
> > But wouldn't it be good to know which ones to ignore, and which ones not to ignore, because
> > some of the ECC reports might be from the memory that you have scrubbed and is part of
> > the newly booted image. Maybe that's literally the reason things went sideways?
> >
> > This can be really hard on lots of hardware, because quite often the ECC error reporting is really really
> > badly done as I outlined. In this case, you don't care about which memory channel and DIMM and rank it
> > was - because that is actually an almost completely random mapping of the actual virtual/physical address
> > that the CPU instruction stream is using (because of interleaving both on a NUMA level and on a chip level,
> > because of address bit remapping by hardware both on the motherboard and inside the CPU chip etc etc).
> >
> > And also, please take note of that "you want the standard OS image to do this all for you". This
> > needs to be standard and architected. Not some kind of "on this particular hardware it works
> > like X, and you look up the ECC details using these model-specific registers, and then translate
> > them using this other machine-specific register, and then do a third translation using the BIOS
> > tables described by this odd corner of the ACPI spec that nobody ever really tested and verified
> > since the spec is 2500 pages long and there are so many other odd corners".
> >
> > See what I'm saying? ECC is just so much more than just "oh,
> > your DIMM is bad, please replace the DIMM that is in slot 5".
> >
> > It's about "maybe I intentionally didn't scrub part of the memory, and it would be lovely
> > to know how much of it actually I can trust even though it wasn't getting refreshed for
> > a second, but honestly, I'll happily take even corrupted data and look for patterns".
> >
> > It's about "maybe I'm under a rowhammer type attack, and I really want to know where
> > the accesses are coming from that cause these ECC bit flips, and what the CPU physical
> > and virtual addresses were - I don't care one whit which DRAM chip it was!".
> >
> > It's about "maybe it's not one DIMM that is going bad, maybe I have some odd high background
> > radiation levels that I didn't even think of, because it turns out I put my server
> > in a basement and it turns out it has exceptionally high radon levels".
> >
> > See?
> >
> > But in reality, ECC error reporiting is a huge mess, and usually horribly badly done. It's almost
> > never architected, so it's a "this chip family does this". It's very seldom well-designed, so
> > it's often hard to impossible to do certain mapping of "I got an ECC error" to the source (usually
> > people do make sure that you can figure out at least which DIMM slot it is in, so that the "replace
> > this DIMM" at least works, but even that is often some black magic).
> >
> > And yes, I'll also very happily admit that systems like the Linux kernel don't necessarily do as good
> > of a job as we could do. Because 99% of developers have no access to the facilities in the first place,
> > can't really test it very well, and because it's not architected, it's not even "this is how you do
> > it on x86" - it's "on this family of Xeon CPU's with this setup, this is how you do it".
> >
> > Is it any wonder that I'm frustrated with it?
>
>
> Sure you don't want to come over to the dark side (Z*)? Machine check handling, including memory errors,
> have been architected since S/360. It'll even tell you it managed to roll back to before the failure.
>
> *We have cookies!
I am glad there are others who appreciate the benefits of the machine check architecture on IBM Z. It is amazing how many things the S/360 architects got right. There is also the TEST BLOCK instruction which will validate a 4-K page and clear up any ECC (or in IBM Z language "checking-block code") errors. Useful if a machine check reported a failing storage address and you want to attempt to see if it was a one time problem or the memory has a hard error.
Topic | Posted By | Date |
---|---|---|
CPU & Memory bit flips | Ganon | 2021/03/03 10:05 AM |
Also "Silent Data Corruption" | Adrian | 2021/03/03 11:42 AM |
Thanks for the reference | Ganon | 2021/03/03 12:47 PM |
Implications for linux page cache | anon | 2021/03/03 12:54 PM |
Implications for linux page cache | Linus Torvalds | 2021/03/03 02:54 PM |
memory errors | blaine | 2021/03/03 03:53 PM |
memory errors | anon2 | 2021/03/03 06:30 PM |
memory errors | dmcq | 2021/03/04 06:16 AM |
memory errors | Etienne Lorrain | 2021/03/04 07:26 AM |
memory errors | dmcq | 2021/03/04 07:40 AM |
memory errors | Etienne Lorrain | 2021/03/04 07:58 AM |
memory errors | dmcq | 2021/03/04 08:12 AM |
memory errors | Carson | 2021/03/05 03:31 AM |
memory errors | Etienne Lorrain | 2021/03/05 07:23 AM |
memory errors | rwessel | 2021/03/05 08:48 AM |
memory errors | dmcq | 2021/03/05 01:01 PM |
memory errors | rwessel | 2021/03/05 01:23 PM |
memory errors | dmcq | 2021/03/05 01:51 PM |
memory errors | Brendan | 2021/03/06 12:38 AM |
memory errors | Carson | 2021/03/06 02:35 AM |
memory errors | Carson | 2021/03/06 07:24 AM |
memory errors | David Hess | 2021/03/04 02:44 PM |
memory errors | rwessel | 2021/03/04 06:14 PM |
memory errors | Linus Torvalds | 2021/03/04 09:21 PM |
memory errors | anon2 | 2021/03/04 10:46 PM |
memory errors | Carson | 2021/03/05 03:43 AM |
memory errors | anon2 | 2021/03/05 08:55 AM |
memory errors | gallier2 | 2021/03/05 03:22 AM |
memory errors | dmcq | 2021/03/05 01:59 PM |
memory errors | David Hess | 2021/03/06 05:27 AM |
memory errors | Carson | 2021/03/06 07:44 AM |
memory errors | Gabriele Svelto | 2021/03/06 11:11 AM |
memory errors | David Hess | 2021/03/06 11:28 AM |
memory errors | Michael S | 2021/03/06 03:45 PM |
memory errors | Doug S | 2021/03/04 11:48 AM |
memory errors | Michael S | 2021/03/04 12:36 PM |
memory errors | Jörn Engel | 2021/03/04 04:32 PM |
memory errors | Linus Torvalds | 2021/03/04 09:47 PM |
memory errors | Etienne Lorrain | 2021/03/05 02:09 AM |
memory errors | Michael S | 2021/03/05 05:06 AM |
memory errors | Linus Torvalds | 2021/03/05 12:59 PM |
memory errors | rwessel | 2021/03/05 01:32 PM |
memory errors | rwessel | 2021/03/05 01:37 PM |
memory errors | zArchJon | 2021/03/06 09:39 PM |
memory errors | Gabriele Svelto | 2021/03/06 01:58 PM |
memory errors | Jörn Engel | 2021/03/05 11:12 AM |
Amiga recoverable RAM disk? | Carson | 2021/03/05 04:03 AM |
Thanks - TIL a cool Amiga feature (nt) (NT) | John | 2021/03/05 01:51 PM |
Another cool Amiga feature, datatypes | Charles | 2021/03/06 01:01 AM |
Another cool Amiga feature, datatypes | Jukka Larja | 2021/03/06 02:23 AM |
Another cool Amiga feature, datatypes | Anon | 2021/03/06 01:40 PM |
Another cool Amiga feature, filesystems | Marcus | 2021/03/07 01:28 AM |
CPU & Memory bit flips | zArchJon | 2021/03/04 07:39 AM |
CPU & Memory bit flips | dmcq | 2021/03/04 07:59 AM |
CPU & Memory bit flips | rwessel | 2021/03/04 01:27 PM |
speak of the devil | Robert Williams | 2021/03/05 08:53 AM |
speak of the devil | dmcq | 2021/03/05 12:26 PM |
speak of the devil | Robert Williams | 2021/03/05 04:15 PM |