By: dmcq (dmcq.delete@this.fano.co.uk), March 4, 2021 7:40 am
Room: Moderated Discussions
Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr) on March 4, 2021 6:26 am wrote:
> dmcq (dmcq.delete@this.fano.co.uk) on March 4, 2021 5:16 am wrote:
> > ...
> >
> > I think the important thing is error detection - not recovery. Error recovery at a low level is nice to
> > have but if the whole business can be fixed at a higher level and the error rate is low enough it is not
> > really necessary. Intel leaving out ECC was dreadful, the
> > thing that I think was really criminal and cretinous
> > though was cutting out even parity checking. I see it as a cheap trick to obscure errors so people just
> > blamed gremlins and pressed ctrl-alt-delete rather than fixing underlyng problems. Of course some memory
> > problems would escape that but it would catch memory that is failing and it would give an indication of
> > how reliable it is overall.
>
> Historically, I think ECC error detection was removed approximately at the time it took too much time to
> initialise the memory. At power-up, the parity bit is not initialised: if you dump the DDR before initialisation
> you get mostly zero bits but you will also get bits set (I do not know why the capacitor is still charged
> at power-up). If you do a quick power-cycle, it is obvious you will still have bits set.
> When the memory of the PC increased to few tens of megabytes, the CPU (at that time) was
> not able to clear that DRAM (so initialise the ECC) in less than 10 seconds, and the PC never
> had a powerful DMA to do such work. To cut boot time, they removed the parity bit.
>
> The problem of when to initialise ECC is still there, and on some embedded system I worked on,
> I was intentionally setting ECC errors on every ECC lines at boot to be sure the O.S. (when present,
> or the bootloader) do not use directly uninitialised memory. Obviously you can only do that if
> you take control of the CPU just after reset, any "secure boot" stuff will not help.
>
> You need to ensure you initialise the ECC correctly, one usual problem is if
> you write less than an ECC line, you may get an ECC error at write time.
>
> That is why IHMO the O.S. should manage itself any ECC (non recoverable) error, ignore it if it
> is new memory allocated to a process, correct it if it comes from a file-backed page, and stop
> the owning process(es) if necessary (i.e. cannot correct). And log the address of the error.
> Having "secure boot" and virtual boxes do not help, but giving only
> invalid ECC memory block would probably also detect bugs there.
Sounds interesting using ECC to detect uninitialised memory.
The initialisation problem could have been easily solved by simply only initialising a page when it was allocated, or ovcerwriting it if it is set from disk. You'd still get some seconds spent on the job but the user wouldn't notice. I can't really believe they wouldn't think of that never mind that booting took a long time anyway so this would be no sort of great gain.
> dmcq (dmcq.delete@this.fano.co.uk) on March 4, 2021 5:16 am wrote:
> > ...
> >
> > I think the important thing is error detection - not recovery. Error recovery at a low level is nice to
> > have but if the whole business can be fixed at a higher level and the error rate is low enough it is not
> > really necessary. Intel leaving out ECC was dreadful, the
> > thing that I think was really criminal and cretinous
> > though was cutting out even parity checking. I see it as a cheap trick to obscure errors so people just
> > blamed gremlins and pressed ctrl-alt-delete rather than fixing underlyng problems. Of course some memory
> > problems would escape that but it would catch memory that is failing and it would give an indication of
> > how reliable it is overall.
>
> Historically, I think ECC error detection was removed approximately at the time it took too much time to
> initialise the memory. At power-up, the parity bit is not initialised: if you dump the DDR before initialisation
> you get mostly zero bits but you will also get bits set (I do not know why the capacitor is still charged
> at power-up). If you do a quick power-cycle, it is obvious you will still have bits set.
> When the memory of the PC increased to few tens of megabytes, the CPU (at that time) was
> not able to clear that DRAM (so initialise the ECC) in less than 10 seconds, and the PC never
> had a powerful DMA to do such work. To cut boot time, they removed the parity bit.
>
> The problem of when to initialise ECC is still there, and on some embedded system I worked on,
> I was intentionally setting ECC errors on every ECC lines at boot to be sure the O.S. (when present,
> or the bootloader) do not use directly uninitialised memory. Obviously you can only do that if
> you take control of the CPU just after reset, any "secure boot" stuff will not help.
>
> You need to ensure you initialise the ECC correctly, one usual problem is if
> you write less than an ECC line, you may get an ECC error at write time.
>
> That is why IHMO the O.S. should manage itself any ECC (non recoverable) error, ignore it if it
> is new memory allocated to a process, correct it if it comes from a file-backed page, and stop
> the owning process(es) if necessary (i.e. cannot correct). And log the address of the error.
> Having "secure boot" and virtual boxes do not help, but giving only
> invalid ECC memory block would probably also detect bugs there.
Sounds interesting using ECC to detect uninitialised memory.
The initialisation problem could have been easily solved by simply only initialising a page when it was allocated, or ovcerwriting it if it is set from disk. You'd still get some seconds spent on the job but the user wouldn't notice. I can't really believe they wouldn't think of that never mind that booting took a long time anyway so this would be no sort of great gain.
Topic | Posted By | Date |
---|---|---|
CPU & Memory bit flips | Ganon | 2021/03/03 10:05 AM |
Also "Silent Data Corruption" | Adrian | 2021/03/03 11:42 AM |
Thanks for the reference | Ganon | 2021/03/03 12:47 PM |
Implications for linux page cache | anon | 2021/03/03 12:54 PM |
Implications for linux page cache | Linus Torvalds | 2021/03/03 02:54 PM |
memory errors | blaine | 2021/03/03 03:53 PM |
memory errors | anon2 | 2021/03/03 06:30 PM |
memory errors | dmcq | 2021/03/04 06:16 AM |
memory errors | Etienne Lorrain | 2021/03/04 07:26 AM |
memory errors | dmcq | 2021/03/04 07:40 AM |
memory errors | Etienne Lorrain | 2021/03/04 07:58 AM |
memory errors | dmcq | 2021/03/04 08:12 AM |
memory errors | Carson | 2021/03/05 03:31 AM |
memory errors | Etienne Lorrain | 2021/03/05 07:23 AM |
memory errors | rwessel | 2021/03/05 08:48 AM |
memory errors | dmcq | 2021/03/05 01:01 PM |
memory errors | rwessel | 2021/03/05 01:23 PM |
memory errors | dmcq | 2021/03/05 01:51 PM |
memory errors | Brendan | 2021/03/06 12:38 AM |
memory errors | Carson | 2021/03/06 02:35 AM |
memory errors | Carson | 2021/03/06 07:24 AM |
memory errors | David Hess | 2021/03/04 02:44 PM |
memory errors | rwessel | 2021/03/04 06:14 PM |
memory errors | Linus Torvalds | 2021/03/04 09:21 PM |
memory errors | anon2 | 2021/03/04 10:46 PM |
memory errors | Carson | 2021/03/05 03:43 AM |
memory errors | anon2 | 2021/03/05 08:55 AM |
memory errors | gallier2 | 2021/03/05 03:22 AM |
memory errors | dmcq | 2021/03/05 01:59 PM |
memory errors | David Hess | 2021/03/06 05:27 AM |
memory errors | Carson | 2021/03/06 07:44 AM |
memory errors | Gabriele Svelto | 2021/03/06 11:11 AM |
memory errors | David Hess | 2021/03/06 11:28 AM |
memory errors | Michael S | 2021/03/06 03:45 PM |
memory errors | Doug S | 2021/03/04 11:48 AM |
memory errors | Michael S | 2021/03/04 12:36 PM |
memory errors | Jörn Engel | 2021/03/04 04:32 PM |
memory errors | Linus Torvalds | 2021/03/04 09:47 PM |
memory errors | Etienne Lorrain | 2021/03/05 02:09 AM |
memory errors | Michael S | 2021/03/05 05:06 AM |
memory errors | Linus Torvalds | 2021/03/05 12:59 PM |
memory errors | rwessel | 2021/03/05 01:32 PM |
memory errors | rwessel | 2021/03/05 01:37 PM |
memory errors | zArchJon | 2021/03/06 09:39 PM |
memory errors | Gabriele Svelto | 2021/03/06 01:58 PM |
memory errors | Jörn Engel | 2021/03/05 11:12 AM |
Amiga recoverable RAM disk? | Carson | 2021/03/05 04:03 AM |
Thanks - TIL a cool Amiga feature (nt) (NT) | John | 2021/03/05 01:51 PM |
Another cool Amiga feature, datatypes | Charles | 2021/03/06 01:01 AM |
Another cool Amiga feature, datatypes | Jukka Larja | 2021/03/06 02:23 AM |
Another cool Amiga feature, datatypes | Anon | 2021/03/06 01:40 PM |
Another cool Amiga feature, filesystems | Marcus | 2021/03/07 01:28 AM |
CPU & Memory bit flips | zArchJon | 2021/03/04 07:39 AM |
CPU & Memory bit flips | dmcq | 2021/03/04 07:59 AM |
CPU & Memory bit flips | rwessel | 2021/03/04 01:27 PM |
speak of the devil | Robert Williams | 2021/03/05 08:53 AM |
speak of the devil | dmcq | 2021/03/05 12:26 PM |
speak of the devil | Robert Williams | 2021/03/05 04:15 PM |