By: Brendan (btrotter.delete@this.gmail.com), July 14, 2021 3:55 pm
Room: Moderated Discussions
Hi,
dmcq (dmcq.delete@this.fano.co.uk) on July 14, 2021 4:27 pm wrote:
> Brendan (btrotter.delete@this.gmail.com) on July 14, 2021 12:50 pm wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on July 14, 2021 10:13 am wrote:
> > > Actually I did write an operating system myself from scratch complete with multiple virtual
> > > machines about thirty years ago. The most difficult error handling was dealing with memory
> > > errors as we expected a couple a week to occur in the various boards using it.
> >
> > I apologize for completely changing the topic; but I've been researching/thinking about/"theorizing
> > about" memory error tolerance (on and off) for about 10 years now - mostly revolving around
> > the obvious "software ECC on top of paging" approach (as described in papers like this one:
> > https://repository.lib.ncsu.edu/bitstream/handle/1840.20/33537/TR-2016-7.pdf ).
> >
> > Are there any interesting insights and/or details you'd be willing to share regarding your experiences?
> >
> > - Brendan
>
> Nothing particularly wonderful. The memory had parity but not hamming - the problem would have been much less
> worrying with hamming but that's what saving money does to you. Tasks for the board could be discarded and restarted
> easily if it said there was an error which made things easier. On the other hand if it didn't recover in a halfway
> decent manner from a failure the host would be unable to restart the board after timing it out.
>
> As to the handling, chunks of read only areas were covered by longitutinal sums and were checked incrementally
> in the watchdog timer and fixed if there there was only one error per chunk. Errors could also be
> detected whilst running or in IO. Read only errors would be fixed. If an error occurred in an unused
> area including parts of buffers which weren't full the error was just fixed. Otherwise errors in writable
> areas which could be ascribed to a particular VM caused an error to be returned for the task, otherwise
> the board said there was a general error and everything was restarted. The handler for the errors
> had a little bit duplicated and provided one got past an initial little part an error in either half
> would be handled by going down the other path. Stuck errors were handled using virtual memory. Details
> would be returned about where errors happened rather like a disk.
Thanks - it definitely sounds like an interesting piece of engineering (and made me think more about providing feedback higher levels could use to further manage uncorrected errors). :-)
- Brendan
dmcq (dmcq.delete@this.fano.co.uk) on July 14, 2021 4:27 pm wrote:
> Brendan (btrotter.delete@this.gmail.com) on July 14, 2021 12:50 pm wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on July 14, 2021 10:13 am wrote:
> > > Actually I did write an operating system myself from scratch complete with multiple virtual
> > > machines about thirty years ago. The most difficult error handling was dealing with memory
> > > errors as we expected a couple a week to occur in the various boards using it.
> >
> > I apologize for completely changing the topic; but I've been researching/thinking about/"theorizing
> > about" memory error tolerance (on and off) for about 10 years now - mostly revolving around
> > the obvious "software ECC on top of paging" approach (as described in papers like this one:
> > https://repository.lib.ncsu.edu/bitstream/handle/1840.20/33537/TR-2016-7.pdf ).
> >
> > Are there any interesting insights and/or details you'd be willing to share regarding your experiences?
> >
> > - Brendan
>
> Nothing particularly wonderful. The memory had parity but not hamming - the problem would have been much less
> worrying with hamming but that's what saving money does to you. Tasks for the board could be discarded and restarted
> easily if it said there was an error which made things easier. On the other hand if it didn't recover in a halfway
> decent manner from a failure the host would be unable to restart the board after timing it out.
>
> As to the handling, chunks of read only areas were covered by longitutinal sums and were checked incrementally
> in the watchdog timer and fixed if there there was only one error per chunk. Errors could also be
> detected whilst running or in IO. Read only errors would be fixed. If an error occurred in an unused
> area including parts of buffers which weren't full the error was just fixed. Otherwise errors in writable
> areas which could be ascribed to a particular VM caused an error to be returned for the task, otherwise
> the board said there was a general error and everything was restarted. The handler for the errors
> had a little bit duplicated and provided one got past an initial little part an error in either half
> would be handled by going down the other path. Stuck errors were handled using virtual memory. Details
> would be returned about where errors happened rather like a disk.
Thanks - it definitely sounds like an interesting piece of engineering (and made me think more about providing feedback higher levels could use to further manage uncorrected errors). :-)
- Brendan
Topic | Posted By | Date |
---|---|---|
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/09 08:56 PM |
Is unsafe hell truly good for linux kernel in the future? | Brendan | 2021/07/09 11:59 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 12:37 PM |
Is unsafe hell truly good for linux kernel in the future? | anon | 2021/07/10 03:14 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 12:40 PM |
Is unsafe hell truly good for linux kernel in the future? | Gabriele Svelto | 2021/07/10 02:59 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 03:42 PM |
Is unsafe hell truly good for linux kernel in the future? | anon | 2021/07/11 05:11 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/12 11:40 AM |
Is unsafe hell truly good for linux kernel in the future? | Foo_ | 2021/07/10 05:56 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 08:59 AM |
Most RWT posters don’t decide what goes into the Linux kernel | Mark Roulo | 2021/07/10 11:55 AM |
Is unsafe hell truly good for linux kernel in the future? | Foo_ | 2021/07/22 10:10 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 09:22 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 09:24 AM |
Déja Vu | Dismissive | 2021/07/10 09:41 AM |
Déja Vu | cqwrteur | 2021/07/10 09:47 AM |
Déja Vu | Dismissive | 2021/07/10 09:51 AM |
Déja Vu | Michael S | 2021/07/10 12:11 PM |
Is unsafe hell truly good for linux kernel in the future? | Gabriele Svelto | 2021/07/10 11:51 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 12:32 PM |
Is unsafe hell truly good for linux kernel in the future? | Michael S | 2021/07/10 01:04 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 01:25 PM |
Is unsafe hell truly good for linux kernel in the future? | Gabriele Svelto | 2021/07/10 02:56 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 03:41 PM |
Is unsafe hell truly good for linux kernel in the future? | Rayla | 2021/07/10 04:33 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 05:27 PM |
Interesting response... (NT) | Rayla | 2021/07/10 08:02 PM |
perhaps just another lousy AI bot? (NT) | anonymou5 | 2021/07/10 08:33 PM |
perhaps just another lousy AI bot? | dmcq | 2021/07/10 10:26 PM |
perhaps just another lousy AI bot? | cqwrteur | 2021/07/10 10:56 PM |
perhaps just another lousy AI bot? | dmcq | 2021/07/11 02:29 AM |
perhaps just another lousy AI bot? | anon | 2021/07/11 05:16 AM |
perhaps just another lousy AI bot? | cqwrteur | 2021/07/12 02:56 PM |
perhaps just another lousy AI bot? | Rayla | 2021/07/11 05:13 AM |
perhaps just another lousy AI bot? | cqwrteur | 2021/07/11 10:59 AM |
When did I call you a bot, Kebabbert? (NT) | Rayla | 2021/07/11 07:51 PM |
Alternatives? | Brendan | 2021/07/11 12:54 AM |
Alternatives? | Michael S | 2021/07/11 05:01 AM |
Alternatives? | Brendan | 2021/07/11 05:51 AM |
Alternatives? | cqwrteur | 2021/07/11 10:58 AM |
Alternatives? | Gabriele Svelto | 2021/07/12 12:31 AM |
Alternatives? | Michael S | 2021/07/12 02:58 AM |
Alternatives? | anon2 | 2021/07/12 08:08 AM |
Alternatives? | Michael S | 2021/07/12 08:22 AM |
cqwrteur: Keep it polite | David Kanter | 2021/07/13 07:59 AM |
Alternatives? | dmcq | 2021/07/12 08:37 AM |
Alternatives? | cqwrteur | 2021/07/12 03:04 PM |
Alternatives? | dmcq | 2021/07/12 03:26 PM |
Alternatives? | cqwrteur | 2021/07/13 12:47 AM |
Alternatives? | dmcq | 2021/07/13 05:54 AM |
Alternatives? | Jörn Engel | 2021/07/13 03:53 PM |
Alternatives? | FrankHB | 2021/07/17 06:56 AM |
Differences between Rust and C/Go | Gabriele Svelto | 2021/07/14 04:57 AM |
Differences between Rust and C/Go | FrankHB | 2021/07/17 08:47 AM |
Alternatives? | FrankHB | 2021/07/12 09:08 AM |
Alternatives? | Gabriele Svelto | 2021/07/14 01:28 PM |
Inappropriate messages removed: cqwrteur | David Kanter | 2021/07/15 09:59 AM |
Alternatives? | FrankHB | 2021/07/16 05:43 AM |
Alternatives? | Anon | 2021/07/16 11:01 AM |
Alternatives? | Gabriele Svelto | 2021/07/16 12:44 PM |
Type abstraction and kernel programming | FrankHB | 2021/07/17 12:44 AM |
Type abstraction and kernel programming | dmcq | 2021/07/18 03:00 AM |
Type abstraction and kernel programming | dmcq | 2021/07/18 03:36 AM |
Type abstraction and kernel programming | Etienne Lorrain | 2021/07/19 12:03 AM |
Type abstraction and kernel programming | dmcq | 2021/07/19 01:01 AM |
Type abstraction and kernel programming | Anon | 2021/07/19 01:05 AM |
Type abstraction and kernel programming | dmcq | 2021/07/19 02:23 AM |
Type abstraction and kernel programming | Brendan | 2021/07/19 06:05 AM |
Alternatives? | gallier2 | 2021/07/20 03:57 AM |
Alternatives? | Anon | 2021/07/20 05:24 AM |
Alternatives? | Michael S | 2021/07/20 09:14 AM |
Alternatives? | Anon | 2021/07/20 09:53 AM |
Alternatives? | gallier2 | 2021/07/21 10:44 PM |
Alternatives? | Adrian | 2021/07/20 11:00 AM |
Alternatives? | Brett | 2021/07/20 10:13 PM |
Alternatives? | Michael S | 2021/07/21 01:12 AM |
Alternatives? | dmcq | 2021/07/22 11:58 AM |
Alternatives? | Anon | 2021/07/21 07:58 AM |
Alternatives? | Brendan | 2021/07/12 01:34 AM |
Alternatives? | FrankHB | 2021/07/12 09:57 AM |
Alternatives? | cqwrteur | 2021/07/12 11:55 AM |
Alternatives? | FrankHB | 2021/07/12 08:44 PM |
Alternatives? | Brendan | 2021/07/12 07:52 PM |
Alternatives? | cqwrteur | 2021/07/12 10:05 PM |
Alternatives? | Anon | 2021/07/12 10:42 PM |
Alternatives? | cqwrteur | 2021/07/12 11:42 PM |
Alternatives? | cqwrteur | 2021/07/12 11:44 PM |
Alternatives? | Anon | 2021/07/13 07:32 PM |
Alternatives? | cqwrteur | 2021/07/13 08:36 PM |
Alternatives? | cqwrteur | 2021/07/13 08:39 PM |
Alternatives? | Anon | 2021/07/13 09:02 PM |
Alternatives? | cqwrteur | 2021/07/13 09:18 PM |
Alternatives? | cqwrteur | 2021/07/13 08:49 PM |
Alternatives? | Anon | 2021/07/13 09:07 PM |
Alternatives? | cqwrteur | 2021/07/13 09:16 PM |
Alternatives? | Anon | 2021/07/13 10:31 PM |
Alternatives? | cqwrteur | 2021/07/13 11:30 PM |
Alternatives? | Anon | 2021/07/14 12:55 AM |
Alternatives? | cqwrteur | 2021/07/14 01:22 AM |
Alternatives? | Anon | 2021/07/14 02:05 AM |
Alternatives? | cqwrteur | 2021/07/14 02:11 AM |
Alternatives? | Anon | 2021/07/14 03:16 AM |
Alternatives? | cqwrteur | 2021/07/14 06:06 AM |
Alternatives? | Anon | 2021/07/14 07:20 AM |
Alternatives? | cqwrteur | 2021/07/14 07:51 AM |
Alternatives? | Anon | 2021/07/14 11:33 AM |
Alternatives? | Gabriele Svelto | 2021/07/14 12:19 PM |
Alternatives? | FrankHB | 2021/07/16 06:07 AM |
Alternatives? | cqwrteur | 2021/07/13 11:33 PM |
Alternatives? | Anon | 2021/07/14 12:57 AM |
Alternatives? | cqwrteur | 2021/07/14 01:21 AM |
Alternatives? | dmcq | 2021/07/14 02:06 AM |
Alternatives? | cqwrteur | 2021/07/14 02:50 AM |
Alternatives? | ⚛ | 2021/07/15 07:33 AM |
Alternatives? | FrankHB | 2021/07/16 06:13 AM |
Alternatives? | cqwrteur | 2021/07/13 11:39 PM |
Alternatives? | Anon | 2021/07/14 01:08 AM |
Alternatives? | cqwrteur | 2021/07/14 01:20 AM |
Alternatives? | dmcq | 2021/07/14 01:46 AM |
Alternatives? | cqwrteur | 2021/07/14 01:52 AM |
Alternatives? | dmcq | 2021/07/14 09:13 AM |
Alternatives? | dmcq | 2021/07/14 09:23 AM |
Dealing with memory errors | Brendan | 2021/07/14 11:50 AM |
Dealing with memory errors | dmcq | 2021/07/14 03:27 PM |
Dealing with memory errors | Brendan | 2021/07/14 03:55 PM |
Alternatives? | cqwrteur | 2021/07/14 02:12 AM |
Alternatives? | Anon | 2021/07/14 03:16 AM |
Alternatives? | cqwrteur | 2021/07/14 05:55 AM |
Alternatives? | FrankHB | 2021/07/16 06:27 AM |
Alternatives? | cqwrteur | 2021/07/14 01:38 AM |
Alternatives? | anon | 2021/07/14 02:50 AM |
Stop feeding that troll | none | 2021/07/14 03:13 AM |
Alternatives? | cqwrteur | 2021/07/14 06:39 AM |
Alternatives? | Brendan | 2021/07/14 11:15 AM |
Alternatives? | Anon | 2021/07/14 03:19 AM |
Alternatives? | cqwrteur | 2021/07/14 06:12 AM |
Alternatives? | Anon | 2021/07/14 07:17 AM |
Alternatives? | cqwrteur | 2021/07/14 07:47 AM |
Alternatives? | Anon | 2021/07/14 12:00 PM |
Alternatives? | cqwrteur | 2021/07/14 12:44 PM |
Alternatives? | ⚛ | 2021/07/15 09:36 AM |
Alternatives? | Gabriele Svelto | 2021/07/14 12:26 PM |
Alternatives? | cqwrteur | 2021/07/14 12:46 PM |
Alternatives? | Gabriele Svelto | 2021/07/14 01:36 PM |
Alternatives? | cqwrteur | 2021/07/14 01:55 PM |
Alternatives? | Smoochie | 2021/07/14 11:07 PM |
Alternatives? | ⚛ | 2021/07/15 07:37 AM |
Alternatives? | Brendan | 2021/07/15 10:21 AM |
Alternatives? | Anon | 2021/07/15 12:15 PM |
Alternatives? | FrankHB | 2021/07/16 06:27 AM |
Alternatives? | None | 2021/07/14 01:50 AM |
Alternatives? | cqwrteur | 2021/07/14 01:54 AM |
Alternatives? | cqwrteur | 2021/07/14 01:55 AM |
Alternatives? | Rayla | 2021/07/14 04:47 AM |
Alternatives? | cqwrteur | 2021/07/14 05:54 AM |
Alternatives? | Gabriele Svelto | 2021/07/14 12:43 PM |
Alternatives? | FrankHB | 2021/07/12 11:47 PM |
Alternatives? | FrankHB | 2021/07/12 11:05 PM |
Alternatives? | Michael S | 2021/07/13 12:01 AM |
Alternatives? | FrankHB | 2021/07/13 12:25 AM |
Alternatives? | Doug S | 2021/07/12 11:29 PM |
Alternatives? | cqwrteur | 2021/07/12 11:48 PM |
Alternatives? | FrankHB | 2021/07/13 12:07 AM |
Is unsafe hell truly good for linux kernel in the future? | ⚛ | 2021/07/12 05:27 AM |
Is unsafe hell truly good for linux kernel in the future? | Anon | 2021/07/12 08:46 AM |
Is unsafe hell truly good for linux kernel in the future? | Etienne Lorrain | 2021/07/13 01:00 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 12:38 PM |