By: Maynard Handley (name99.delete@this.name99.org), December 17, 2020 5:55 pm
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on December 17, 2020 12:04 pm wrote:
> Maynard Handley (name99.delete@this.name99.org) on December 17, 2020 9:42 am wrote:
> > Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr) on December 17, 2020 9:26 am wrote:
> > > Maynard Handley (name99.delete@this.name99.org) on December 17, 2020 9:09 am wrote:
> > > > Björn Ragnar Björnsson (bjorn.ragnar.delete@this.gmail.com) on December 16, 2020 10:09 pm wrote:
> > > > > Adrian (a.delete@this.acm.org) on December 16, 2020 7:31 am wrote:
> > > > > > Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on December 15, 2020 3:24 pm wrote:
> > > > > > > It seems like Intel has added support for what they call in-band ECC to their recent Atom SoCs, see the
> > > > > > > mention here as well as here. There isn't much in the way of details on Intel pages apart from the
> > > > > > > fact that the mechanism can correct single-bit errors in
> > > > > > > non-ECC memory (presumably by reducing its effective
> > > > > > > size). However a Google search turned out this patent. All-in-all a very welcome development.
> > > > > >
> > > > > >
> > > > > > The support for In-Band ECC also exists in Tiger Lake U, but it is disabled in almost all SKUs,
> > > > > > including in most of the "Embedded" SKUs, where I would have expected it to be enabled.
> > > > > >
> > > > > > It is enabled only in the Tiger Lake U "Embedded" SKUs for the Extended Temperature Range.
> > > > > >
> > > > > > In-Band ECC allows the use of ECC with the LPDDR4x memories, but this advantage is paid
> > > > > > by a slight reduction in memory capacity and by a reduction in speed that is difficult to
> > > > > > quantify, because in most cases the extra accesses for ECC can be cached (the worst but
> > > > > > very seldom case is to do twice as many memory accesses, both for data and for ECC).
> > > > > >
> > > > > >
> > > > > > Intel has a patent application for it
> > > > >
> > > > > The padawans here may not know that back in the day, even after they were born, it was
> > > > > as simple as pie to order simms and dimms with parity/ecc at little additional cost.
> > > > >
> > > > > Sure, if you didn't care about ECC you could shave off a few percentage points on
> > > > > the price of memory. Big deal, but not for everybody, I'd always go for the slightly
> > > > > more expensive option (at least I could use it in system that supported it).
> > > > >
> > > > > Then something happened which changed the DRAM landscape forever (hopefully not forever though).
> > > > > What happened? Intel started producing processors that had no, absolutely no, nada, ability to implement
> > > > > ECC. These CPUs have for the last two decades been by far the bulk of processors sold. Could you
> > > > > have undetected memory errors? Yes, it's a near certainty. Could these have had serious consequences?
> > > > > Hard to say, are you working on anything that could have serious consequences?
> > > > >
> > > > > At a guess, an ECC capable DIMM costs 5-9% more to get to an end user than one that is
> > > > > without ECC. I'm willing to pay more than that for such a "fancy" feature, but oh-no, it's
> > > > > nearly impossible to source un-buffered ECC ram at reasonable speeds and/or prices.
> > > > >
> > > > > This in-line ECC appears to be a colossal kludge, presented as
> > > > > a feature, to solve a problem that never should have existed.
> > > > >
> > > > > > https://www.freepatentsonline.com/y2019/0332469.html
> > > > > >
> > > > > > which should have been rejected, because it describes good methods to implement
> > > > > > In-Band ECC, but which are completely obvious and are exactly like anyone would
> > > > > > implement it if given the task, without any other prior knowledge.
> > > > >
> > > > >
> > > >
> > > > From the startup log on an M1 boot...
> > > > 10.281418 AppleFireStormErrorHandler AppleARM64ErrorHandler: will not panic on correctible ECC errors
> > > >
> > >
> > > That is nice not to panic on correctible ECC error (instead
> > > of panic-ing, just correct the error and log the address).
> > > Obviously a correctible ECC error on a protected memory area (i.e. even the OS
> > > cannot read) would need to panic if the correction is not done in hardware.
> > > Now the question is also, do you only correct the read value (risking un-correctible error
> > > if another bit error appears on that address), or do you write-back the correction?
> > > Next step is not to panic on uncorrectible ECC error, just reload if possible,
> > > or kill only the task affected (or only the virtual machine affected).
> >
> > It all remains unclear.
> > One possibility is that this refers PURELY to ECC in caches (so not especially interesting).
> >
> > Another is that there's the possibility for in-line ECC but this is not yet hooked up?
> >
> > Another is that there is genuine in-line ECC, working exactly as you would
> > hope (and presumably with RAS functionality being added over time, eg even
> > a non-correctable error in a block of memory that also exists on disk).
> >
> > That's all even apart from the issue of how the OS intervenes. Hopefully
> > over the next year or so people will figure out more details.
>
> LPDDR doesn't have ECC, so I am skeptical that Apple uses ECC memory.
>
> I think your interpretation that it's ECC on caches is more likely to be correct.
>
> I don't think Apple has the volume to develop non-standard LPDDR memory interfaces
> and modules. But I could be wrong, and it would be awesome if they did.
>
> David
As I keep trying to remind you, with plentiful transistors many clever possibilities become available...
Even with traditional 8bit DRAM at least two options present themselves:
(a) store the ECC in a reserved section of the DRAM. Sure, this presents bandwidth conflicts, but use of an ECC cache in the memory controller should limit the damage.
This is the obvious solution, and likely what Intel are doing.
(b) use memory compression. For example compress each 128-byte line down to ~63bytes + an ECC byte, and store than in RAM. Qualcomm implemented this exact scenario on Falkor. Obviously, exactly as I've described it this gives you probablistic ECC, some lines covered others not. If you're willing to also implement a less aggressive compression you can probably fit most of the remaining lines into 127(or 126) bytes + an ECC byte or two.
At which point, sure, it's not perfect, only probabilistic, and you're not going to sell it as a z/ replacement. But it does give your home system a nice little RAS boost, plus, hopefully some early warning that a DRAM chip is going bad.
(Admittedly, with M1, your options as to what to do about this are limited.
Certainly you'll be able to get a replacement if you're within warranty which is nice; maybe Apple will also provide a rough chipkill that will have the OS in future just avoid the bad chip? Which, if the machine is 5 years old and retired to non-frontline duty, is, what the heck, probably good enough for many purposes though I fully expect plenty of people to complain...)
I honestly don't know where Apple stands on the range of possibilities from "ECC is purely a feature of our on-SoC caches" to "yeah, it's on our radar, one day we'll hook it up to the SoC hardware" to "we have working memory compression+ECC right now, bitches; you just haven't noticed yet".
> Maynard Handley (name99.delete@this.name99.org) on December 17, 2020 9:42 am wrote:
> > Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr) on December 17, 2020 9:26 am wrote:
> > > Maynard Handley (name99.delete@this.name99.org) on December 17, 2020 9:09 am wrote:
> > > > Björn Ragnar Björnsson (bjorn.ragnar.delete@this.gmail.com) on December 16, 2020 10:09 pm wrote:
> > > > > Adrian (a.delete@this.acm.org) on December 16, 2020 7:31 am wrote:
> > > > > > Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on December 15, 2020 3:24 pm wrote:
> > > > > > > It seems like Intel has added support for what they call in-band ECC to their recent Atom SoCs, see the
> > > > > > > mention here as well as here. There isn't much in the way of details on Intel pages apart from the
> > > > > > > fact that the mechanism can correct single-bit errors in
> > > > > > > non-ECC memory (presumably by reducing its effective
> > > > > > > size). However a Google search turned out this patent. All-in-all a very welcome development.
> > > > > >
> > > > > >
> > > > > > The support for In-Band ECC also exists in Tiger Lake U, but it is disabled in almost all SKUs,
> > > > > > including in most of the "Embedded" SKUs, where I would have expected it to be enabled.
> > > > > >
> > > > > > It is enabled only in the Tiger Lake U "Embedded" SKUs for the Extended Temperature Range.
> > > > > >
> > > > > > In-Band ECC allows the use of ECC with the LPDDR4x memories, but this advantage is paid
> > > > > > by a slight reduction in memory capacity and by a reduction in speed that is difficult to
> > > > > > quantify, because in most cases the extra accesses for ECC can be cached (the worst but
> > > > > > very seldom case is to do twice as many memory accesses, both for data and for ECC).
> > > > > >
> > > > > >
> > > > > > Intel has a patent application for it
> > > > >
> > > > > The padawans here may not know that back in the day, even after they were born, it was
> > > > > as simple as pie to order simms and dimms with parity/ecc at little additional cost.
> > > > >
> > > > > Sure, if you didn't care about ECC you could shave off a few percentage points on
> > > > > the price of memory. Big deal, but not for everybody, I'd always go for the slightly
> > > > > more expensive option (at least I could use it in system that supported it).
> > > > >
> > > > > Then something happened which changed the DRAM landscape forever (hopefully not forever though).
> > > > > What happened? Intel started producing processors that had no, absolutely no, nada, ability to implement
> > > > > ECC. These CPUs have for the last two decades been by far the bulk of processors sold. Could you
> > > > > have undetected memory errors? Yes, it's a near certainty. Could these have had serious consequences?
> > > > > Hard to say, are you working on anything that could have serious consequences?
> > > > >
> > > > > At a guess, an ECC capable DIMM costs 5-9% more to get to an end user than one that is
> > > > > without ECC. I'm willing to pay more than that for such a "fancy" feature, but oh-no, it's
> > > > > nearly impossible to source un-buffered ECC ram at reasonable speeds and/or prices.
> > > > >
> > > > > This in-line ECC appears to be a colossal kludge, presented as
> > > > > a feature, to solve a problem that never should have existed.
> > > > >
> > > > > > https://www.freepatentsonline.com/y2019/0332469.html
> > > > > >
> > > > > > which should have been rejected, because it describes good methods to implement
> > > > > > In-Band ECC, but which are completely obvious and are exactly like anyone would
> > > > > > implement it if given the task, without any other prior knowledge.
> > > > >
> > > > >
> > > >
> > > > From the startup log on an M1 boot...
> > > > 10.281418 AppleFireStormErrorHandler AppleARM64ErrorHandler: will not panic on correctible ECC errors
> > > >
> > >
> > > That is nice not to panic on correctible ECC error (instead
> > > of panic-ing, just correct the error and log the address).
> > > Obviously a correctible ECC error on a protected memory area (i.e. even the OS
> > > cannot read) would need to panic if the correction is not done in hardware.
> > > Now the question is also, do you only correct the read value (risking un-correctible error
> > > if another bit error appears on that address), or do you write-back the correction?
> > > Next step is not to panic on uncorrectible ECC error, just reload if possible,
> > > or kill only the task affected (or only the virtual machine affected).
> >
> > It all remains unclear.
> > One possibility is that this refers PURELY to ECC in caches (so not especially interesting).
> >
> > Another is that there's the possibility for in-line ECC but this is not yet hooked up?
> >
> > Another is that there is genuine in-line ECC, working exactly as you would
> > hope (and presumably with RAS functionality being added over time, eg even
> > a non-correctable error in a block of memory that also exists on disk).
> >
> > That's all even apart from the issue of how the OS intervenes. Hopefully
> > over the next year or so people will figure out more details.
>
> LPDDR doesn't have ECC, so I am skeptical that Apple uses ECC memory.
>
> I think your interpretation that it's ECC on caches is more likely to be correct.
>
> I don't think Apple has the volume to develop non-standard LPDDR memory interfaces
> and modules. But I could be wrong, and it would be awesome if they did.
>
> David
As I keep trying to remind you, with plentiful transistors many clever possibilities become available...
Even with traditional 8bit DRAM at least two options present themselves:
(a) store the ECC in a reserved section of the DRAM. Sure, this presents bandwidth conflicts, but use of an ECC cache in the memory controller should limit the damage.
This is the obvious solution, and likely what Intel are doing.
(b) use memory compression. For example compress each 128-byte line down to ~63bytes + an ECC byte, and store than in RAM. Qualcomm implemented this exact scenario on Falkor. Obviously, exactly as I've described it this gives you probablistic ECC, some lines covered others not. If you're willing to also implement a less aggressive compression you can probably fit most of the remaining lines into 127(or 126) bytes + an ECC byte or two.
At which point, sure, it's not perfect, only probabilistic, and you're not going to sell it as a z/ replacement. But it does give your home system a nice little RAS boost, plus, hopefully some early warning that a DRAM chip is going bad.
(Admittedly, with M1, your options as to what to do about this are limited.
Certainly you'll be able to get a replacement if you're within warranty which is nice; maybe Apple will also provide a rough chipkill that will have the OS in future just avoid the bad chip? Which, if the machine is 5 years old and retired to non-frontline duty, is, what the heck, probably good enough for many purposes though I fully expect plenty of people to complain...)
I honestly don't know where Apple stands on the range of possibilities from "ECC is purely a feature of our on-SoC caches" to "yeah, it's on our radar, one day we'll hook it up to the SoC hardware" to "we have working memory compression+ECC right now, bitches; you just haven't noticed yet".