By: ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com), January 6, 2021 9:45 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on January 5, 2021 12:29 pm wrote:
> ECC is safer under normal circumstances,
What is "normal circumstances"? From a mathematical viewpoint, ECC DDR4 modules can afford to be of lower quality (and thus maybe can be cheaper) than non-ECC DDR4 modules because, simply, an ECC machine can correct a bit flip and keep running while a similar bit flip on a non-ECC machine would lead to a crash.
> but it also allows you to do more and live
> on the edge more, in other words. You can actually see when you're getting too close
> to the edge when the machine starts reporting a lot of correctable errors!
Given that something like 99% of non-server machines are non-ECC, maybe a question is in order: When is partial software ECC coming to the Linux kernel? At the very least the kernel should checksum all pages that are read-only or mostly read-only, for example certain pages belonging to the filesystem cache. It would be calming to the user to know that the HDD/SSD data belonging to for example /lib64/libc-2.32.so which are mapped read-only and are cached in memory do have the same checksum as the checksum of the corresponding HDD/SSD data on the storage device if the machine has been running without a reboot for a week.
(I haven't read the whole ECC discussion thread. My apologies if I am asking a question that has already been answered.)
-atom
> ECC is safer under normal circumstances,
What is "normal circumstances"? From a mathematical viewpoint, ECC DDR4 modules can afford to be of lower quality (and thus maybe can be cheaper) than non-ECC DDR4 modules because, simply, an ECC machine can correct a bit flip and keep running while a similar bit flip on a non-ECC machine would lead to a crash.
> but it also allows you to do more and live
> on the edge more, in other words. You can actually see when you're getting too close
> to the edge when the machine starts reporting a lot of correctable errors!
Given that something like 99% of non-server machines are non-ECC, maybe a question is in order: When is partial software ECC coming to the Linux kernel? At the very least the kernel should checksum all pages that are read-only or mostly read-only, for example certain pages belonging to the filesystem cache. It would be calming to the user to know that the HDD/SSD data belonging to for example /lib64/libc-2.32.so which are mapped read-only and are cached in memory do have the same checksum as the checksum of the corresponding HDD/SSD data on the storage device if the machine has been running without a reboot for a week.
(I haven't read the whole ECC discussion thread. My apologies if I am asking a question that has already been answered.)
-atom