By: rwessel (rwessel.delete@this.yahoo.com), January 7, 2021 11:00 am
Room: Moderated Discussions
Terry Gray (cuyahogan.delete@this.aol.com) on January 7, 2021 9:47 am wrote:
> Jörn Engel (joern.delete@this.purestorage.com) on January 7, 2021 9:05 am wrote:
> > Emanuel Rylke (ema.delete@this.mailbox.org) on January 7, 2021 12:49 am wrote:
> > >
> > > What about doing it not as a workaround for broken hardware
> > > but to make it more easy to show that the hardware
> > > is broken? In theory I know that I'm probably getting bit
> > > errors and that's bad(TM) but if a cat /proc/bit_errors
> > > showed me that I got at least 5 since boot I would be much more motivated to do something about it.
> >
> > Unrealistic for writable pages. Doable for read-only. You need a bit of shadow memory to store
> > the checksums and some fast hash function. Assuming you cannot use vector instructions, performance
> > would be 16 bytes per cycle or 256 cycles per page. You should calculate hashes when pages turn
> > read-only, again before they become writable and maybe periodically in between.
> >
> > Do you care enough to write a patch?
>
> Back in the 1960s Oregon State University had a CDC 3300 (24 bit computer).
>
> Some of the other students I shared an office with wrote an operating system for it called OS3
> (Oregon State Open Shop Operating System).
>
> It had parity memory and to recover from errors in progrem code each sector had an exclusive OR
> of the contents as the last word in a sector. When an error occurred the word in error was known
> so they could calculate what that word should have been. So this idea is not new. But interesting
> that I have never heard of it being used anywhere else (although it may have been).
That's just RAID 4, if applied to disks.
And that's the basic idea behind RAIM or chipkill style systems.
> Jörn Engel (joern.delete@this.purestorage.com) on January 7, 2021 9:05 am wrote:
> > Emanuel Rylke (ema.delete@this.mailbox.org) on January 7, 2021 12:49 am wrote:
> > >
> > > What about doing it not as a workaround for broken hardware
> > > but to make it more easy to show that the hardware
> > > is broken? In theory I know that I'm probably getting bit
> > > errors and that's bad(TM) but if a cat /proc/bit_errors
> > > showed me that I got at least 5 since boot I would be much more motivated to do something about it.
> >
> > Unrealistic for writable pages. Doable for read-only. You need a bit of shadow memory to store
> > the checksums and some fast hash function. Assuming you cannot use vector instructions, performance
> > would be 16 bytes per cycle or 256 cycles per page. You should calculate hashes when pages turn
> > read-only, again before they become writable and maybe periodically in between.
> >
> > Do you care enough to write a patch?
>
> Back in the 1960s Oregon State University had a CDC 3300 (24 bit computer).
>
> Some of the other students I shared an office with wrote an operating system for it called OS3
> (Oregon State Open Shop Operating System).
>
> It had parity memory and to recover from errors in progrem code each sector had an exclusive OR
> of the contents as the last word in a sector. When an error occurred the word in error was known
> so they could calculate what that word should have been. So this idea is not new. But interesting
> that I have never heard of it being used anywhere else (although it may have been).
That's just RAID 4, if applied to disks.
And that's the basic idea behind RAIM or chipkill style systems.