memory errors

By: dmcq (, March 4, 2021 7:40 am
Etienne Lorrain ( on March 4, 2021 6:26 am wrote:
> dmcq ( on March 4, 2021 5:16 am wrote:
> > ...
> >
> > I think the important thing is error detection - not recovery. Error recovery at a low level is nice to
> > have but if the whole business can be fixed at a higher level and the error rate is low enough it is not
> > really necessary. Intel leaving out ECC was dreadful, the
> > thing that I think was really criminal and cretinous
> > though was cutting out even parity checking. I see it as a cheap trick to obscure errors so people just
> > blamed gremlins and pressed ctrl-alt-delete rather than fixing underlyng problems. Of course some memory
> > problems would escape that but it would catch memory that is failing and it would give an indication of
> > how reliable it is overall.
> Historically, I think ECC error detection was removed approximately at the time it took too much time to
> initialise the memory. At power-up, the parity bit is not initialised: if you dump the DDR before initialisation
> you get mostly zero bits but you will also get bits set (I do not know why the capacitor is still charged
> at power-up). If you do a quick power-cycle, it is obvious you will still have bits set.
> When the memory of the PC increased to few tens of megabytes, the CPU (at that time) was
> not able to clear that DRAM (so initialise the ECC) in less than 10 seconds, and the PC never
> had a powerful DMA to do such work. To cut boot time, they removed the parity bit.
> The problem of when to initialise ECC is still there, and on some embedded system I worked on,
> I was intentionally setting ECC errors on every ECC lines at boot to be sure the O.S. (when present,
> or the bootloader) do not use directly uninitialised memory. Obviously you can only do that if
> you take control of the CPU just after reset, any "secure boot" stuff will not help.
> You need to ensure you initialise the ECC correctly, one usual problem is if
> you write less than an ECC line, you may get an ECC error at write time.
> That is why IHMO the O.S. should manage itself any ECC (non recoverable) error, ignore it if it
> is new memory allocated to a process, correct it if it comes from a file-backed page, and stop
> the owning process(es) if necessary (i.e. cannot correct). And log the address of the error.
> Having "secure boot" and virtual boxes do not help, but giving only
> invalid ECC memory block would probably also detect bugs there.

Sounds interesting using ECC to detect uninitialised memory.

The initialisation problem could have been easily solved by simply only initialising a page when it was allocated, or ovcerwriting it if it is set from disk. You'd still get some seconds spent on the job but the user wouldn't notice. I can't really believe they wouldn't think of that never mind that booting took a long time anyway so this would be no sort of great gain.

