By: Brendan (btrotter.delete@this.gmail.com), July 14, 2021 4:55 pm
Room: Moderated Discussions
Hi,
dmcq (dmcq.delete@this.fano.co.uk) on July 14, 2021 4:27 pm wrote:
> Brendan (btrotter.delete@this.gmail.com) on July 14, 2021 12:50 pm wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on July 14, 2021 10:13 am wrote:
> > > Actually I did write an operating system myself from scratch complete with multiple virtual
> > > machines about thirty years ago. The most difficult error handling was dealing with memory
> > > errors as we expected a couple a week to occur in the various boards using it.
> >
> > I apologize for completely changing the topic; but I've been researching/thinking about/"theorizing
> > about" memory error tolerance (on and off) for about 10 years now - mostly revolving around
> > the obvious "software ECC on top of paging" approach (as described in papers like this one:
> > https://repository.lib.ncsu.edu/bitstream/handle/1840.20/33537/TR-2016-7.pdf ).
> >
> > Are there any interesting insights and/or details you'd be willing to share regarding your experiences?
> >
> > - Brendan
>
> Nothing particularly wonderful. The memory had parity but not hamming - the problem would have been much less
> worrying with hamming but that's what saving money does to you. Tasks for the board could be discarded and restarted
> easily if it said there was an error which made things easier. On the other hand if it didn't recover in a halfway
> decent manner from a failure the host would be unable to restart the board after timing it out.
>
> As to the handling, chunks of read only areas were covered by longitutinal sums and were checked incrementally
> in the watchdog timer and fixed if there there was only one error per chunk. Errors could also be
> detected whilst running or in IO. Read only errors would be fixed. If an error occurred in an unused
> area including parts of buffers which weren't full the error was just fixed. Otherwise errors in writable
> areas which could be ascribed to a particular VM caused an error to be returned for the task, otherwise
> the board said there was a general error and everything was restarted. The handler for the errors
> had a little bit duplicated and provided one got past an initial little part an error in either half
> would be handled by going down the other path. Stuck errors were handled using virtual memory. Details
> would be returned about where errors happened rather like a disk.
Thanks - it definitely sounds like an interesting piece of engineering (and made me think more about providing feedback higher levels could use to further manage uncorrected errors). :-)
- Brendan
dmcq (dmcq.delete@this.fano.co.uk) on July 14, 2021 4:27 pm wrote:
> Brendan (btrotter.delete@this.gmail.com) on July 14, 2021 12:50 pm wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on July 14, 2021 10:13 am wrote:
> > > Actually I did write an operating system myself from scratch complete with multiple virtual
> > > machines about thirty years ago. The most difficult error handling was dealing with memory
> > > errors as we expected a couple a week to occur in the various boards using it.
> >
> > I apologize for completely changing the topic; but I've been researching/thinking about/"theorizing
> > about" memory error tolerance (on and off) for about 10 years now - mostly revolving around
> > the obvious "software ECC on top of paging" approach (as described in papers like this one:
> > https://repository.lib.ncsu.edu/bitstream/handle/1840.20/33537/TR-2016-7.pdf ).
> >
> > Are there any interesting insights and/or details you'd be willing to share regarding your experiences?
> >
> > - Brendan
>
> Nothing particularly wonderful. The memory had parity but not hamming - the problem would have been much less
> worrying with hamming but that's what saving money does to you. Tasks for the board could be discarded and restarted
> easily if it said there was an error which made things easier. On the other hand if it didn't recover in a halfway
> decent manner from a failure the host would be unable to restart the board after timing it out.
>
> As to the handling, chunks of read only areas were covered by longitutinal sums and were checked incrementally
> in the watchdog timer and fixed if there there was only one error per chunk. Errors could also be
> detected whilst running or in IO. Read only errors would be fixed. If an error occurred in an unused
> area including parts of buffers which weren't full the error was just fixed. Otherwise errors in writable
> areas which could be ascribed to a particular VM caused an error to be returned for the task, otherwise
> the board said there was a general error and everything was restarted. The handler for the errors
> had a little bit duplicated and provided one got past an initial little part an error in either half
> would be handled by going down the other path. Stuck errors were handled using virtual memory. Details
> would be returned about where errors happened rather like a disk.
Thanks - it definitely sounds like an interesting piece of engineering (and made me think more about providing feedback higher levels could use to further manage uncorrected errors). :-)
- Brendan
Topic | Posted By | Date |
---|---|---|
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/09 09:56 PM |
Is unsafe hell truly good for linux kernel in the future? | Brendan | 2021/07/10 12:59 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 01:37 PM |
Is unsafe hell truly good for linux kernel in the future? | anon | 2021/07/10 04:14 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 01:40 PM |
Is unsafe hell truly good for linux kernel in the future? | Gabriele Svelto | 2021/07/10 03:59 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 04:42 PM |
Is unsafe hell truly good for linux kernel in the future? | anon | 2021/07/11 06:11 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/12 12:40 PM |
Is unsafe hell truly good for linux kernel in the future? | Foo_ | 2021/07/10 06:56 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 09:59 AM |
Most RWT posters don’t decide what goes into the Linux kernel | Mark Roulo | 2021/07/10 12:55 PM |
Is unsafe hell truly good for linux kernel in the future? | Foo_ | 2021/07/22 11:10 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 10:22 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 10:24 AM |
Déja Vu | Dismissive | 2021/07/10 10:41 AM |
Déja Vu | cqwrteur | 2021/07/10 10:47 AM |
Déja Vu | Dismissive | 2021/07/10 10:51 AM |
Déja Vu | Michael S | 2021/07/10 01:11 PM |
Is unsafe hell truly good for linux kernel in the future? | Gabriele Svelto | 2021/07/10 12:51 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 01:32 PM |
Is unsafe hell truly good for linux kernel in the future? | Michael S | 2021/07/10 02:04 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 02:25 PM |
Is unsafe hell truly good for linux kernel in the future? | Gabriele Svelto | 2021/07/10 03:56 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 04:41 PM |
Is unsafe hell truly good for linux kernel in the future? | Rayla | 2021/07/10 05:33 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 06:27 PM |
Interesting response... (NT) | Rayla | 2021/07/10 09:02 PM |
perhaps just another lousy AI bot? (NT) | anonymou5 | 2021/07/10 09:33 PM |
perhaps just another lousy AI bot? | dmcq | 2021/07/10 11:26 PM |
perhaps just another lousy AI bot? | cqwrteur | 2021/07/10 11:56 PM |
perhaps just another lousy AI bot? | dmcq | 2021/07/11 03:29 AM |
perhaps just another lousy AI bot? | anon | 2021/07/11 06:16 AM |
perhaps just another lousy AI bot? | cqwrteur | 2021/07/12 03:56 PM |
perhaps just another lousy AI bot? | Rayla | 2021/07/11 06:13 AM |
perhaps just another lousy AI bot? | cqwrteur | 2021/07/11 11:59 AM |
When did I call you a bot, Kebabbert? (NT) | Rayla | 2021/07/11 08:51 PM |
Alternatives? | Brendan | 2021/07/11 01:54 AM |
Alternatives? | Michael S | 2021/07/11 06:01 AM |
Alternatives? | Brendan | 2021/07/11 06:51 AM |
Alternatives? | cqwrteur | 2021/07/11 11:58 AM |
Alternatives? | Gabriele Svelto | 2021/07/12 01:31 AM |
Alternatives? | Michael S | 2021/07/12 03:58 AM |
Alternatives? | anon2 | 2021/07/12 09:08 AM |
Alternatives? | Michael S | 2021/07/12 09:22 AM |
cqwrteur: Keep it polite | David Kanter | 2021/07/13 08:59 AM |
Alternatives? | dmcq | 2021/07/12 09:37 AM |
Alternatives? | cqwrteur | 2021/07/12 04:04 PM |
Alternatives? | dmcq | 2021/07/12 04:26 PM |
Alternatives? | cqwrteur | 2021/07/13 01:47 AM |
Alternatives? | dmcq | 2021/07/13 06:54 AM |
Alternatives? | Jörn Engel | 2021/07/13 04:53 PM |
Alternatives? | FrankHB | 2021/07/17 07:56 AM |
Differences between Rust and C/Go | Gabriele Svelto | 2021/07/14 05:57 AM |
Differences between Rust and C/Go | FrankHB | 2021/07/17 09:47 AM |
Alternatives? | FrankHB | 2021/07/12 10:08 AM |
Alternatives? | Gabriele Svelto | 2021/07/14 02:28 PM |
Inappropriate messages removed: cqwrteur | David Kanter | 2021/07/15 10:59 AM |
Alternatives? | FrankHB | 2021/07/16 06:43 AM |
Alternatives? | Anon | 2021/07/16 12:01 PM |
Alternatives? | Gabriele Svelto | 2021/07/16 01:44 PM |
Type abstraction and kernel programming | FrankHB | 2021/07/17 01:44 AM |
Type abstraction and kernel programming | dmcq | 2021/07/18 04:00 AM |
Type abstraction and kernel programming | dmcq | 2021/07/18 04:36 AM |
Type abstraction and kernel programming | Etienne Lorrain | 2021/07/19 01:03 AM |
Type abstraction and kernel programming | dmcq | 2021/07/19 02:01 AM |
Type abstraction and kernel programming | Anon | 2021/07/19 02:05 AM |
Type abstraction and kernel programming | dmcq | 2021/07/19 03:23 AM |
Type abstraction and kernel programming | Brendan | 2021/07/19 07:05 AM |
Alternatives? | gallier2 | 2021/07/20 04:57 AM |
Alternatives? | Anon | 2021/07/20 06:24 AM |
Alternatives? | Michael S | 2021/07/20 10:14 AM |
Alternatives? | Anon | 2021/07/20 10:53 AM |
Alternatives? | gallier2 | 2021/07/21 11:44 PM |
Alternatives? | Adrian | 2021/07/20 12:00 PM |
Alternatives? | Brett | 2021/07/20 11:13 PM |
Alternatives? | Michael S | 2021/07/21 02:12 AM |
Alternatives? | dmcq | 2021/07/22 12:58 PM |
Alternatives? | Anon | 2021/07/21 08:58 AM |
Alternatives? | Brendan | 2021/07/12 02:34 AM |
Alternatives? | FrankHB | 2021/07/12 10:57 AM |
Alternatives? | cqwrteur | 2021/07/12 12:55 PM |
Alternatives? | FrankHB | 2021/07/12 09:44 PM |
Alternatives? | Brendan | 2021/07/12 08:52 PM |
Alternatives? | cqwrteur | 2021/07/12 11:05 PM |
Alternatives? | Anon | 2021/07/12 11:42 PM |
Alternatives? | cqwrteur | 2021/07/13 12:42 AM |
Alternatives? | cqwrteur | 2021/07/13 12:44 AM |
Alternatives? | Anon | 2021/07/13 08:32 PM |
Alternatives? | cqwrteur | 2021/07/13 09:36 PM |
Alternatives? | cqwrteur | 2021/07/13 09:39 PM |
Alternatives? | Anon | 2021/07/13 10:02 PM |
Alternatives? | cqwrteur | 2021/07/13 10:18 PM |
Alternatives? | cqwrteur | 2021/07/13 09:49 PM |
Alternatives? | Anon | 2021/07/13 10:07 PM |
Alternatives? | cqwrteur | 2021/07/13 10:16 PM |
Alternatives? | Anon | 2021/07/13 11:31 PM |
Alternatives? | cqwrteur | 2021/07/14 12:30 AM |
Alternatives? | Anon | 2021/07/14 01:55 AM |
Alternatives? | cqwrteur | 2021/07/14 02:22 AM |
Alternatives? | Anon | 2021/07/14 03:05 AM |
Alternatives? | cqwrteur | 2021/07/14 03:11 AM |
Alternatives? | Anon | 2021/07/14 04:16 AM |
Alternatives? | cqwrteur | 2021/07/14 07:06 AM |
Alternatives? | Anon | 2021/07/14 08:20 AM |
Alternatives? | cqwrteur | 2021/07/14 08:51 AM |
Alternatives? | Anon | 2021/07/14 12:33 PM |
Alternatives? | Gabriele Svelto | 2021/07/14 01:19 PM |
Alternatives? | FrankHB | 2021/07/16 07:07 AM |
Alternatives? | cqwrteur | 2021/07/14 12:33 AM |
Alternatives? | Anon | 2021/07/14 01:57 AM |
Alternatives? | cqwrteur | 2021/07/14 02:21 AM |
Alternatives? | dmcq | 2021/07/14 03:06 AM |
Alternatives? | cqwrteur | 2021/07/14 03:50 AM |
Alternatives? | ⚛ | 2021/07/15 08:33 AM |
Alternatives? | FrankHB | 2021/07/16 07:13 AM |
Alternatives? | cqwrteur | 2021/07/14 12:39 AM |
Alternatives? | Anon | 2021/07/14 02:08 AM |
Alternatives? | cqwrteur | 2021/07/14 02:20 AM |
Alternatives? | dmcq | 2021/07/14 02:46 AM |
Alternatives? | cqwrteur | 2021/07/14 02:52 AM |
Alternatives? | dmcq | 2021/07/14 10:13 AM |
Alternatives? | dmcq | 2021/07/14 10:23 AM |
Dealing with memory errors | Brendan | 2021/07/14 12:50 PM |
Dealing with memory errors | dmcq | 2021/07/14 04:27 PM |
Dealing with memory errors | Brendan | 2021/07/14 04:55 PM |
Alternatives? | cqwrteur | 2021/07/14 03:12 AM |
Alternatives? | Anon | 2021/07/14 04:16 AM |
Alternatives? | cqwrteur | 2021/07/14 06:55 AM |
Alternatives? | FrankHB | 2021/07/16 07:27 AM |
Alternatives? | cqwrteur | 2021/07/14 02:38 AM |
Alternatives? | anon | 2021/07/14 03:50 AM |
Stop feeding that troll | none | 2021/07/14 04:13 AM |
Alternatives? | cqwrteur | 2021/07/14 07:39 AM |
Alternatives? | Brendan | 2021/07/14 12:15 PM |
Alternatives? | Anon | 2021/07/14 04:19 AM |
Alternatives? | cqwrteur | 2021/07/14 07:12 AM |
Alternatives? | Anon | 2021/07/14 08:17 AM |
Alternatives? | cqwrteur | 2021/07/14 08:47 AM |
Alternatives? | Anon | 2021/07/14 01:00 PM |
Alternatives? | cqwrteur | 2021/07/14 01:44 PM |
Alternatives? | ⚛ | 2021/07/15 10:36 AM |
Alternatives? | Gabriele Svelto | 2021/07/14 01:26 PM |
Alternatives? | cqwrteur | 2021/07/14 01:46 PM |
Alternatives? | Gabriele Svelto | 2021/07/14 02:36 PM |
Alternatives? | cqwrteur | 2021/07/14 02:55 PM |
Alternatives? | Smoochie | 2021/07/15 12:07 AM |
Alternatives? | ⚛ | 2021/07/15 08:37 AM |
Alternatives? | Brendan | 2021/07/15 11:21 AM |
Alternatives? | Anon | 2021/07/15 01:15 PM |
Alternatives? | FrankHB | 2021/07/16 07:27 AM |
Alternatives? | None | 2021/07/14 02:50 AM |
Alternatives? | cqwrteur | 2021/07/14 02:54 AM |
Alternatives? | cqwrteur | 2021/07/14 02:55 AM |
Alternatives? | Rayla | 2021/07/14 05:47 AM |
Alternatives? | cqwrteur | 2021/07/14 06:54 AM |
Alternatives? | Gabriele Svelto | 2021/07/14 01:43 PM |
Alternatives? | FrankHB | 2021/07/13 12:47 AM |
Alternatives? | FrankHB | 2021/07/13 12:05 AM |
Alternatives? | Michael S | 2021/07/13 01:01 AM |
Alternatives? | FrankHB | 2021/07/13 01:25 AM |
Alternatives? | Doug S | 2021/07/13 12:29 AM |
Alternatives? | cqwrteur | 2021/07/13 12:48 AM |
Alternatives? | FrankHB | 2021/07/13 01:07 AM |
Is unsafe hell truly good for linux kernel in the future? | ⚛ | 2021/07/12 06:27 AM |
Is unsafe hell truly good for linux kernel in the future? | Anon | 2021/07/12 09:46 AM |
Is unsafe hell truly good for linux kernel in the future? | Etienne Lorrain | 2021/07/13 02:00 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 01:38 PM |