By: Emanuel Rylke (ema.delete@this.mailbox.org), January 7, 2021 12:49 am
Room: Moderated Discussions
⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on January 6, 2021 9:45 am wrote:
> Given that something like 99% of non-server machines are non-ECC, maybe a question is in order: When is
> partial software ECC coming to the Linux kernel? At the very least the kernel should checksum all pages
> that are read-only or mostly read-only, for example certain pages belonging to the filesystem cache. It would
> be calming to the user to know that the HDD/SSD data belonging to for example /lib64/libc-2.32.so which are
> mapped read-only and are cached in memory do have the same checksum as the checksum of the corresponding
> HDD/SSD data on the storage device if the machine has been running without a reboot for a week.
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on January 6, 2021 11:38 am wrote:
> And no, it's not the job of the OS to fix broken hardware.
What about doing it not as a workaround for broken hardware but to make it more easy to show that the hardware is broken? In theory I know that I'm probably getting bit errors and that's bad(TM) but if a cat /proc/bit_errors showed me that I got at least 5 since boot I would be much more motivated to do something about it.
Emanuel
> Given that something like 99% of non-server machines are non-ECC, maybe a question is in order: When is
> partial software ECC coming to the Linux kernel? At the very least the kernel should checksum all pages
> that are read-only or mostly read-only, for example certain pages belonging to the filesystem cache. It would
> be calming to the user to know that the HDD/SSD data belonging to for example /lib64/libc-2.32.so which are
> mapped read-only and are cached in memory do have the same checksum as the checksum of the corresponding
> HDD/SSD data on the storage device if the machine has been running without a reboot for a week.
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on January 6, 2021 11:38 am wrote:
> And no, it's not the job of the OS to fix broken hardware.
What about doing it not as a workaround for broken hardware but to make it more easy to show that the hardware is broken? In theory I know that I'm probably getting bit errors and that's bad(TM) but if a cat /proc/bit_errors showed me that I got at least 5 since boot I would be much more motivated to do something about it.
Emanuel