By: Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr), March 4, 2021 7:58 am
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on March 4, 2021 6:40 am wrote:
> Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr) on March 4, 2021 6:26 am wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on March 4, 2021 5:16 am wrote:
> > > ...
> > >
> > > I think the important thing is error detection - not recovery. Error recovery at a low level is nice to
> > > have but if the whole business can be fixed at a higher level and the error rate is low enough it is not
> > > really necessary. Intel leaving out ECC was dreadful, the
> > > thing that I think was really criminal and cretinous
> > > though was cutting out even parity checking. I see it as a cheap trick to obscure errors so people just
> > > blamed gremlins and pressed ctrl-alt-delete rather than fixing underlyng problems. Of course some memory
> > > problems would escape that but it would catch memory that is failing and it would give an indication of
> > > how reliable it is overall.
> >
> > Historically, I think ECC error detection was removed approximately at the time it took too much time to
> > initialise the memory. At power-up, the parity bit is not
> > initialised: if you dump the DDR before initialisation
> > you get mostly zero bits but you will also get bits set (I do not know why the capacitor is still charged
> > at power-up). If you do a quick power-cycle, it is obvious you will still have bits set.
> > When the memory of the PC increased to few tens of megabytes, the CPU (at that time) was
> > not able to clear that DRAM (so initialise the ECC) in less than 10 seconds, and the PC never
> > had a powerful DMA to do such work. To cut boot time, they removed the parity bit.
> >
> > The problem of when to initialise ECC is still there, and on some embedded system I worked on,
> > I was intentionally setting ECC errors on every ECC lines at boot to be sure the O.S. (when present,
> > or the bootloader) do not use directly uninitialised memory. Obviously you can only do that if
> > you take control of the CPU just after reset, any "secure boot" stuff will not help.
> >
> > You need to ensure you initialise the ECC correctly, one usual problem is if
> > you write less than an ECC line, you may get an ECC error at write time.
> >
> > That is why IHMO the O.S. should manage itself any ECC (non recoverable) error, ignore it if it
> > is new memory allocated to a process, correct it if it comes from a file-backed page, and stop
> > the owning process(es) if necessary (i.e. cannot correct). And log the address of the error.
> > Having "secure boot" and virtual boxes do not help, but giving only
> > invalid ECC memory block would probably also detect bugs there.
>
> Sounds interesting using ECC to detect uninitialised memory.
>
> The initialisation problem could have been easily solved by simply only initialising a page when
> it was allocated, or ovcerwriting it if it is set from disk. You'd still get some seconds spent
> on the job but the user wouldn't notice. I can't really believe they wouldn't think of that never
> mind that booting took a long time anyway so this would be no sort of great gain.
>
>
You cannot "only initialising a page when it was allocated" because the BIOS do not allocate memory, it just tells you how much memory is available, and the concept of page is not known (no virtual memory at that point). Moreover the page size is decided by the OS, processor support different sizes. Moreover memory over 1 Megabyte was special, either EMS or XMS or HMA.
And you could not modify DOS, proprietary software with no source.
The real missing thing was (and still partly is) a proper DMA, able to send more than 64 Kbytes and able to access more than 1 Megabyte / 16 Mbytes with chipset recognition and NDA docs.
> Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr) on March 4, 2021 6:26 am wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on March 4, 2021 5:16 am wrote:
> > > ...
> > >
> > > I think the important thing is error detection - not recovery. Error recovery at a low level is nice to
> > > have but if the whole business can be fixed at a higher level and the error rate is low enough it is not
> > > really necessary. Intel leaving out ECC was dreadful, the
> > > thing that I think was really criminal and cretinous
> > > though was cutting out even parity checking. I see it as a cheap trick to obscure errors so people just
> > > blamed gremlins and pressed ctrl-alt-delete rather than fixing underlyng problems. Of course some memory
> > > problems would escape that but it would catch memory that is failing and it would give an indication of
> > > how reliable it is overall.
> >
> > Historically, I think ECC error detection was removed approximately at the time it took too much time to
> > initialise the memory. At power-up, the parity bit is not
> > initialised: if you dump the DDR before initialisation
> > you get mostly zero bits but you will also get bits set (I do not know why the capacitor is still charged
> > at power-up). If you do a quick power-cycle, it is obvious you will still have bits set.
> > When the memory of the PC increased to few tens of megabytes, the CPU (at that time) was
> > not able to clear that DRAM (so initialise the ECC) in less than 10 seconds, and the PC never
> > had a powerful DMA to do such work. To cut boot time, they removed the parity bit.
> >
> > The problem of when to initialise ECC is still there, and on some embedded system I worked on,
> > I was intentionally setting ECC errors on every ECC lines at boot to be sure the O.S. (when present,
> > or the bootloader) do not use directly uninitialised memory. Obviously you can only do that if
> > you take control of the CPU just after reset, any "secure boot" stuff will not help.
> >
> > You need to ensure you initialise the ECC correctly, one usual problem is if
> > you write less than an ECC line, you may get an ECC error at write time.
> >
> > That is why IHMO the O.S. should manage itself any ECC (non recoverable) error, ignore it if it
> > is new memory allocated to a process, correct it if it comes from a file-backed page, and stop
> > the owning process(es) if necessary (i.e. cannot correct). And log the address of the error.
> > Having "secure boot" and virtual boxes do not help, but giving only
> > invalid ECC memory block would probably also detect bugs there.
>
> Sounds interesting using ECC to detect uninitialised memory.
>
> The initialisation problem could have been easily solved by simply only initialising a page when
> it was allocated, or ovcerwriting it if it is set from disk. You'd still get some seconds spent
> on the job but the user wouldn't notice. I can't really believe they wouldn't think of that never
> mind that booting took a long time anyway so this would be no sort of great gain.
>
>
You cannot "only initialising a page when it was allocated" because the BIOS do not allocate memory, it just tells you how much memory is available, and the concept of page is not known (no virtual memory at that point). Moreover the page size is decided by the OS, processor support different sizes. Moreover memory over 1 Megabyte was special, either EMS or XMS or HMA.
And you could not modify DOS, proprietary software with no source.
The real missing thing was (and still partly is) a proper DMA, able to send more than 64 Kbytes and able to access more than 1 Megabyte / 16 Mbytes with chipset recognition and NDA docs.
Topic | Posted By | Date |
---|---|---|
CPU & Memory bit flips | Ganon | 2021/03/03 10:05 AM |
Also "Silent Data Corruption" | Adrian | 2021/03/03 11:42 AM |
Thanks for the reference | Ganon | 2021/03/03 12:47 PM |
Implications for linux page cache | anon | 2021/03/03 12:54 PM |
Implications for linux page cache | Linus Torvalds | 2021/03/03 02:54 PM |
memory errors | blaine | 2021/03/03 03:53 PM |
memory errors | anon2 | 2021/03/03 06:30 PM |
memory errors | dmcq | 2021/03/04 06:16 AM |
memory errors | Etienne Lorrain | 2021/03/04 07:26 AM |
memory errors | dmcq | 2021/03/04 07:40 AM |
memory errors | Etienne Lorrain | 2021/03/04 07:58 AM |
memory errors | dmcq | 2021/03/04 08:12 AM |
memory errors | Carson | 2021/03/05 03:31 AM |
memory errors | Etienne Lorrain | 2021/03/05 07:23 AM |
memory errors | rwessel | 2021/03/05 08:48 AM |
memory errors | dmcq | 2021/03/05 01:01 PM |
memory errors | rwessel | 2021/03/05 01:23 PM |
memory errors | dmcq | 2021/03/05 01:51 PM |
memory errors | Brendan | 2021/03/06 12:38 AM |
memory errors | Carson | 2021/03/06 02:35 AM |
memory errors | Carson | 2021/03/06 07:24 AM |
memory errors | David Hess | 2021/03/04 02:44 PM |
memory errors | rwessel | 2021/03/04 06:14 PM |
memory errors | Linus Torvalds | 2021/03/04 09:21 PM |
memory errors | anon2 | 2021/03/04 10:46 PM |
memory errors | Carson | 2021/03/05 03:43 AM |
memory errors | anon2 | 2021/03/05 08:55 AM |
memory errors | gallier2 | 2021/03/05 03:22 AM |
memory errors | dmcq | 2021/03/05 01:59 PM |
memory errors | David Hess | 2021/03/06 05:27 AM |
memory errors | Carson | 2021/03/06 07:44 AM |
memory errors | Gabriele Svelto | 2021/03/06 11:11 AM |
memory errors | David Hess | 2021/03/06 11:28 AM |
memory errors | Michael S | 2021/03/06 03:45 PM |
memory errors | Doug S | 2021/03/04 11:48 AM |
memory errors | Michael S | 2021/03/04 12:36 PM |
memory errors | Jörn Engel | 2021/03/04 04:32 PM |
memory errors | Linus Torvalds | 2021/03/04 09:47 PM |
memory errors | Etienne Lorrain | 2021/03/05 02:09 AM |
memory errors | Michael S | 2021/03/05 05:06 AM |
memory errors | Linus Torvalds | 2021/03/05 12:59 PM |
memory errors | rwessel | 2021/03/05 01:32 PM |
memory errors | rwessel | 2021/03/05 01:37 PM |
memory errors | zArchJon | 2021/03/06 09:39 PM |
memory errors | Gabriele Svelto | 2021/03/06 01:58 PM |
memory errors | Jörn Engel | 2021/03/05 11:12 AM |
Amiga recoverable RAM disk? | Carson | 2021/03/05 04:03 AM |
Thanks - TIL a cool Amiga feature (nt) (NT) | John | 2021/03/05 01:51 PM |
Another cool Amiga feature, datatypes | Charles | 2021/03/06 01:01 AM |
Another cool Amiga feature, datatypes | Jukka Larja | 2021/03/06 02:23 AM |
Another cool Amiga feature, datatypes | Anon | 2021/03/06 01:40 PM |
Another cool Amiga feature, filesystems | Marcus | 2021/03/07 01:28 AM |
CPU & Memory bit flips | zArchJon | 2021/03/04 07:39 AM |
CPU & Memory bit flips | dmcq | 2021/03/04 07:59 AM |
CPU & Memory bit flips | rwessel | 2021/03/04 01:27 PM |
speak of the devil | Robert Williams | 2021/03/05 08:53 AM |
speak of the devil | dmcq | 2021/03/05 12:26 PM |
speak of the devil | Robert Williams | 2021/03/05 04:15 PM |