By: dmcq (dmcq.delete@this.fano.co.uk), March 4, 2021 6:16 am
Room: Moderated Discussions
anon2 (anon.delete@this.anon.com) on March 3, 2021 5:30 pm wrote:
> blaine (myname.delete@this.acm.org) on March 3, 2021 2:53 pm wrote:
> > In designing the HP(E) Superdome, where possible, we tried
> > to do end to end protection. There are some areas
> > where that is not possible (like when you use Intel processors, unless you use voting).
>
> What do you mean by this?
>
> I think by end-to-end, Linus just means that (for highly critical data) then the error correction metadata
> should be generated where the data is generated and stored where the data is stored and checked where
> the data is consumed. Not that all or any particular component along the way must have a given failure
> rate or particular error handling strategy. Although your end to end strategy obviously has to take
> into account the reliability of the components to target an overall error profile.
>
> As it pertains to the Linux pagecache -- it makes a lot of sense to add individual error
> improvement strategies in parts of hardware you know the characteristics of in order to
> achieve the desired error rates, it makes a lot of sense to have an application that is
> designed to run on a hardware stack with a particular error rate profile. It makes less
> sense for intermediate layers to "just add some more ECC for good measure, just in case".
I think the important thing is error detection - not recovery. Error recovery at a low level is nice to have but if the whole business can be fixed at a higher level and the error rate is low enough it is not really necessary. Intel leaving out ECC was dreadful, the thing that I think was really criminal and cretinous though was cutting out even parity checking. I see it as a cheap trick to obscure errors so people just blamed gremlins and pressed ctrl-alt-delete rather than fixing underlyng problems. Of course some memory problems would escape that but it would catch memory that is failing and it would give an indication of how reliable it is overall. Error in the CPU should be apparent quicker but even there I think there should be a background test running every so often so the less used facilities are checked properly.
Of course as Linus says nothing can be perfect. I like the story I heard about a reliability course being held whilst the first shuttle was about to fly. They did a study of its clock system as part of the course and put the launch on the TV as a good ending. And then they heard the background talk as the launch was delayed. That's the bit we were just talking about they said. It was exactly the clock synchronization problem they had been discussing! ;-)
> blaine (myname.delete@this.acm.org) on March 3, 2021 2:53 pm wrote:
> > In designing the HP(E) Superdome, where possible, we tried
> > to do end to end protection. There are some areas
> > where that is not possible (like when you use Intel processors, unless you use voting).
>
> What do you mean by this?
>
> I think by end-to-end, Linus just means that (for highly critical data) then the error correction metadata
> should be generated where the data is generated and stored where the data is stored and checked where
> the data is consumed. Not that all or any particular component along the way must have a given failure
> rate or particular error handling strategy. Although your end to end strategy obviously has to take
> into account the reliability of the components to target an overall error profile.
>
> As it pertains to the Linux pagecache -- it makes a lot of sense to add individual error
> improvement strategies in parts of hardware you know the characteristics of in order to
> achieve the desired error rates, it makes a lot of sense to have an application that is
> designed to run on a hardware stack with a particular error rate profile. It makes less
> sense for intermediate layers to "just add some more ECC for good measure, just in case".
I think the important thing is error detection - not recovery. Error recovery at a low level is nice to have but if the whole business can be fixed at a higher level and the error rate is low enough it is not really necessary. Intel leaving out ECC was dreadful, the thing that I think was really criminal and cretinous though was cutting out even parity checking. I see it as a cheap trick to obscure errors so people just blamed gremlins and pressed ctrl-alt-delete rather than fixing underlyng problems. Of course some memory problems would escape that but it would catch memory that is failing and it would give an indication of how reliable it is overall. Error in the CPU should be apparent quicker but even there I think there should be a background test running every so often so the less used facilities are checked properly.
Of course as Linus says nothing can be perfect. I like the story I heard about a reliability course being held whilst the first shuttle was about to fly. They did a study of its clock system as part of the course and put the launch on the TV as a good ending. And then they heard the background talk as the launch was delayed. That's the bit we were just talking about they said. It was exactly the clock synchronization problem they had been discussing! ;-)
Topic | Posted By | Date |
---|---|---|
CPU & Memory bit flips | Ganon | 2021/03/03 10:05 AM |
Also "Silent Data Corruption" | Adrian | 2021/03/03 11:42 AM |
Thanks for the reference | Ganon | 2021/03/03 12:47 PM |
Implications for linux page cache | anon | 2021/03/03 12:54 PM |
Implications for linux page cache | Linus Torvalds | 2021/03/03 02:54 PM |
memory errors | blaine | 2021/03/03 03:53 PM |
memory errors | anon2 | 2021/03/03 06:30 PM |
memory errors | dmcq | 2021/03/04 06:16 AM |
memory errors | Etienne Lorrain | 2021/03/04 07:26 AM |
memory errors | dmcq | 2021/03/04 07:40 AM |
memory errors | Etienne Lorrain | 2021/03/04 07:58 AM |
memory errors | dmcq | 2021/03/04 08:12 AM |
memory errors | Carson | 2021/03/05 03:31 AM |
memory errors | Etienne Lorrain | 2021/03/05 07:23 AM |
memory errors | rwessel | 2021/03/05 08:48 AM |
memory errors | dmcq | 2021/03/05 01:01 PM |
memory errors | rwessel | 2021/03/05 01:23 PM |
memory errors | dmcq | 2021/03/05 01:51 PM |
memory errors | Brendan | 2021/03/06 12:38 AM |
memory errors | Carson | 2021/03/06 02:35 AM |
memory errors | Carson | 2021/03/06 07:24 AM |
memory errors | David Hess | 2021/03/04 02:44 PM |
memory errors | rwessel | 2021/03/04 06:14 PM |
memory errors | Linus Torvalds | 2021/03/04 09:21 PM |
memory errors | anon2 | 2021/03/04 10:46 PM |
memory errors | Carson | 2021/03/05 03:43 AM |
memory errors | anon2 | 2021/03/05 08:55 AM |
memory errors | gallier2 | 2021/03/05 03:22 AM |
memory errors | dmcq | 2021/03/05 01:59 PM |
memory errors | David Hess | 2021/03/06 05:27 AM |
memory errors | Carson | 2021/03/06 07:44 AM |
memory errors | Gabriele Svelto | 2021/03/06 11:11 AM |
memory errors | David Hess | 2021/03/06 11:28 AM |
memory errors | Michael S | 2021/03/06 03:45 PM |
memory errors | Doug S | 2021/03/04 11:48 AM |
memory errors | Michael S | 2021/03/04 12:36 PM |
memory errors | Jörn Engel | 2021/03/04 04:32 PM |
memory errors | Linus Torvalds | 2021/03/04 09:47 PM |
memory errors | Etienne Lorrain | 2021/03/05 02:09 AM |
memory errors | Michael S | 2021/03/05 05:06 AM |
memory errors | Linus Torvalds | 2021/03/05 12:59 PM |
memory errors | rwessel | 2021/03/05 01:32 PM |
memory errors | rwessel | 2021/03/05 01:37 PM |
memory errors | zArchJon | 2021/03/06 09:39 PM |
memory errors | Gabriele Svelto | 2021/03/06 01:58 PM |
memory errors | Jörn Engel | 2021/03/05 11:12 AM |
Amiga recoverable RAM disk? | Carson | 2021/03/05 04:03 AM |
Thanks - TIL a cool Amiga feature (nt) (NT) | John | 2021/03/05 01:51 PM |
Another cool Amiga feature, datatypes | Charles | 2021/03/06 01:01 AM |
Another cool Amiga feature, datatypes | Jukka Larja | 2021/03/06 02:23 AM |
Another cool Amiga feature, datatypes | Anon | 2021/03/06 01:40 PM |
Another cool Amiga feature, filesystems | Marcus | 2021/03/07 01:28 AM |
CPU & Memory bit flips | zArchJon | 2021/03/04 07:39 AM |
CPU & Memory bit flips | dmcq | 2021/03/04 07:59 AM |
CPU & Memory bit flips | rwessel | 2021/03/04 01:27 PM |
speak of the devil | Robert Williams | 2021/03/05 08:53 AM |
speak of the devil | dmcq | 2021/03/05 12:26 PM |
speak of the devil | Robert Williams | 2021/03/05 04:15 PM |