By: dmcq (dmcq.delete@this.fano.co.uk), July 14, 2021 4:27 pm
Room: Moderated Discussions
Brendan (btrotter.delete@this.gmail.com) on July 14, 2021 12:50 pm wrote:
> Hi,
>
> dmcq (dmcq.delete@this.fano.co.uk) on July 14, 2021 10:13 am wrote:
> > Actually I did write an operating system myself from scratch complete with multiple virtual
> > machines about thirty years ago. The most difficult error handling was dealing with memory
> > errors as we expected a couple a week to occur in the various boards using it.
>
> I apologize for completely changing the topic; but I've been researching/thinking about/"theorizing
> about" memory error tolerance (on and off) for about 10 years now - mostly revolving around
> the obvious "software ECC on top of paging" approach (as described in papers like this one:
> https://repository.lib.ncsu.edu/bitstream/handle/1840.20/33537/TR-2016-7.pdf ).
>
> Are there any interesting insights and/or details you'd be willing to share regarding your experiences?
>
> - Brendan
Nothing particularly wonderful. The memory had parity but not hamming - the problem would have been much less worrying with hamming but that's what saving money does to you. Tasks for the board could be discarded and restarted easily if it said there was an error which made things easier. On the other hand if it didn't recover in a halfway decent manner from a failure the host would be unable to restart the board after timing it out.
As to the handling, chunks of read only areas were covered by longitutinal sums and were checked incrementally in the watchdog timer and fixed if there there was only one error per chunk. Errors could also be detected whilst running or in IO. Read only errors would be fixed. If an error occurred in an unused area including parts of buffers which weren't full the error was just fixed. Otherwise errors in writable areas which could be ascribed to a particular VM caused an error to be returned for the task, otherwise the board said there was a general error and everything was restarted. The handler for the errors had a little bit duplicated and provided one got past an initial little part an error in either half would be handled by going down the other path. Stuck errors were handled using virtual memory. Details would be returned about where errors happened rather like a disk.
> Hi,
>
> dmcq (dmcq.delete@this.fano.co.uk) on July 14, 2021 10:13 am wrote:
> > Actually I did write an operating system myself from scratch complete with multiple virtual
> > machines about thirty years ago. The most difficult error handling was dealing with memory
> > errors as we expected a couple a week to occur in the various boards using it.
>
> I apologize for completely changing the topic; but I've been researching/thinking about/"theorizing
> about" memory error tolerance (on and off) for about 10 years now - mostly revolving around
> the obvious "software ECC on top of paging" approach (as described in papers like this one:
> https://repository.lib.ncsu.edu/bitstream/handle/1840.20/33537/TR-2016-7.pdf ).
>
> Are there any interesting insights and/or details you'd be willing to share regarding your experiences?
>
> - Brendan
Nothing particularly wonderful. The memory had parity but not hamming - the problem would have been much less worrying with hamming but that's what saving money does to you. Tasks for the board could be discarded and restarted easily if it said there was an error which made things easier. On the other hand if it didn't recover in a halfway decent manner from a failure the host would be unable to restart the board after timing it out.
As to the handling, chunks of read only areas were covered by longitutinal sums and were checked incrementally in the watchdog timer and fixed if there there was only one error per chunk. Errors could also be detected whilst running or in IO. Read only errors would be fixed. If an error occurred in an unused area including parts of buffers which weren't full the error was just fixed. Otherwise errors in writable areas which could be ascribed to a particular VM caused an error to be returned for the task, otherwise the board said there was a general error and everything was restarted. The handler for the errors had a little bit duplicated and provided one got past an initial little part an error in either half would be handled by going down the other path. Stuck errors were handled using virtual memory. Details would be returned about where errors happened rather like a disk.
Topic | Posted By | Date |
---|---|---|
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/09 09:56 PM |
Is unsafe hell truly good for linux kernel in the future? | Brendan | 2021/07/10 12:59 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 01:37 PM |
Is unsafe hell truly good for linux kernel in the future? | anon | 2021/07/10 04:14 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 01:40 PM |
Is unsafe hell truly good for linux kernel in the future? | Gabriele Svelto | 2021/07/10 03:59 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 04:42 PM |
Is unsafe hell truly good for linux kernel in the future? | anon | 2021/07/11 06:11 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/12 12:40 PM |
Is unsafe hell truly good for linux kernel in the future? | Foo_ | 2021/07/10 06:56 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 09:59 AM |
Most RWT posters don’t decide what goes into the Linux kernel | Mark Roulo | 2021/07/10 12:55 PM |
Is unsafe hell truly good for linux kernel in the future? | Foo_ | 2021/07/22 11:10 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 10:22 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 10:24 AM |
Déja Vu | Dismissive | 2021/07/10 10:41 AM |
Déja Vu | cqwrteur | 2021/07/10 10:47 AM |
Déja Vu | Dismissive | 2021/07/10 10:51 AM |
Déja Vu | Michael S | 2021/07/10 01:11 PM |
Is unsafe hell truly good for linux kernel in the future? | Gabriele Svelto | 2021/07/10 12:51 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 01:32 PM |
Is unsafe hell truly good for linux kernel in the future? | Michael S | 2021/07/10 02:04 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 02:25 PM |
Is unsafe hell truly good for linux kernel in the future? | Gabriele Svelto | 2021/07/10 03:56 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 04:41 PM |
Is unsafe hell truly good for linux kernel in the future? | Rayla | 2021/07/10 05:33 PM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 06:27 PM |
Interesting response... (NT) | Rayla | 2021/07/10 09:02 PM |
perhaps just another lousy AI bot? (NT) | anonymou5 | 2021/07/10 09:33 PM |
perhaps just another lousy AI bot? | dmcq | 2021/07/10 11:26 PM |
perhaps just another lousy AI bot? | cqwrteur | 2021/07/10 11:56 PM |
perhaps just another lousy AI bot? | dmcq | 2021/07/11 03:29 AM |
perhaps just another lousy AI bot? | anon | 2021/07/11 06:16 AM |
perhaps just another lousy AI bot? | cqwrteur | 2021/07/12 03:56 PM |
perhaps just another lousy AI bot? | Rayla | 2021/07/11 06:13 AM |
perhaps just another lousy AI bot? | cqwrteur | 2021/07/11 11:59 AM |
When did I call you a bot, Kebabbert? (NT) | Rayla | 2021/07/11 08:51 PM |
Alternatives? | Brendan | 2021/07/11 01:54 AM |
Alternatives? | Michael S | 2021/07/11 06:01 AM |
Alternatives? | Brendan | 2021/07/11 06:51 AM |
Alternatives? | cqwrteur | 2021/07/11 11:58 AM |
Alternatives? | Gabriele Svelto | 2021/07/12 01:31 AM |
Alternatives? | Michael S | 2021/07/12 03:58 AM |
Alternatives? | anon2 | 2021/07/12 09:08 AM |
Alternatives? | Michael S | 2021/07/12 09:22 AM |
cqwrteur: Keep it polite | David Kanter | 2021/07/13 08:59 AM |
Alternatives? | dmcq | 2021/07/12 09:37 AM |
Alternatives? | cqwrteur | 2021/07/12 04:04 PM |
Alternatives? | dmcq | 2021/07/12 04:26 PM |
Alternatives? | cqwrteur | 2021/07/13 01:47 AM |
Alternatives? | dmcq | 2021/07/13 06:54 AM |
Alternatives? | Jörn Engel | 2021/07/13 04:53 PM |
Alternatives? | FrankHB | 2021/07/17 07:56 AM |
Differences between Rust and C/Go | Gabriele Svelto | 2021/07/14 05:57 AM |
Differences between Rust and C/Go | FrankHB | 2021/07/17 09:47 AM |
Alternatives? | FrankHB | 2021/07/12 10:08 AM |
Alternatives? | Gabriele Svelto | 2021/07/14 02:28 PM |
Inappropriate messages removed: cqwrteur | David Kanter | 2021/07/15 10:59 AM |
Alternatives? | FrankHB | 2021/07/16 06:43 AM |
Alternatives? | Anon | 2021/07/16 12:01 PM |
Alternatives? | Gabriele Svelto | 2021/07/16 01:44 PM |
Type abstraction and kernel programming | FrankHB | 2021/07/17 01:44 AM |
Type abstraction and kernel programming | dmcq | 2021/07/18 04:00 AM |
Type abstraction and kernel programming | dmcq | 2021/07/18 04:36 AM |
Type abstraction and kernel programming | Etienne Lorrain | 2021/07/19 01:03 AM |
Type abstraction and kernel programming | dmcq | 2021/07/19 02:01 AM |
Type abstraction and kernel programming | Anon | 2021/07/19 02:05 AM |
Type abstraction and kernel programming | dmcq | 2021/07/19 03:23 AM |
Type abstraction and kernel programming | Brendan | 2021/07/19 07:05 AM |
Alternatives? | gallier2 | 2021/07/20 04:57 AM |
Alternatives? | Anon | 2021/07/20 06:24 AM |
Alternatives? | Michael S | 2021/07/20 10:14 AM |
Alternatives? | Anon | 2021/07/20 10:53 AM |
Alternatives? | gallier2 | 2021/07/21 11:44 PM |
Alternatives? | Adrian | 2021/07/20 12:00 PM |
Alternatives? | Brett | 2021/07/20 11:13 PM |
Alternatives? | Michael S | 2021/07/21 02:12 AM |
Alternatives? | dmcq | 2021/07/22 12:58 PM |
Alternatives? | Anon | 2021/07/21 08:58 AM |
Alternatives? | Brendan | 2021/07/12 02:34 AM |
Alternatives? | FrankHB | 2021/07/12 10:57 AM |
Alternatives? | cqwrteur | 2021/07/12 12:55 PM |
Alternatives? | FrankHB | 2021/07/12 09:44 PM |
Alternatives? | Brendan | 2021/07/12 08:52 PM |
Alternatives? | cqwrteur | 2021/07/12 11:05 PM |
Alternatives? | Anon | 2021/07/12 11:42 PM |
Alternatives? | cqwrteur | 2021/07/13 12:42 AM |
Alternatives? | cqwrteur | 2021/07/13 12:44 AM |
Alternatives? | Anon | 2021/07/13 08:32 PM |
Alternatives? | cqwrteur | 2021/07/13 09:36 PM |
Alternatives? | cqwrteur | 2021/07/13 09:39 PM |
Alternatives? | Anon | 2021/07/13 10:02 PM |
Alternatives? | cqwrteur | 2021/07/13 10:18 PM |
Alternatives? | cqwrteur | 2021/07/13 09:49 PM |
Alternatives? | Anon | 2021/07/13 10:07 PM |
Alternatives? | cqwrteur | 2021/07/13 10:16 PM |
Alternatives? | Anon | 2021/07/13 11:31 PM |
Alternatives? | cqwrteur | 2021/07/14 12:30 AM |
Alternatives? | Anon | 2021/07/14 01:55 AM |
Alternatives? | cqwrteur | 2021/07/14 02:22 AM |
Alternatives? | Anon | 2021/07/14 03:05 AM |
Alternatives? | cqwrteur | 2021/07/14 03:11 AM |
Alternatives? | Anon | 2021/07/14 04:16 AM |
Alternatives? | cqwrteur | 2021/07/14 07:06 AM |
Alternatives? | Anon | 2021/07/14 08:20 AM |
Alternatives? | cqwrteur | 2021/07/14 08:51 AM |
Alternatives? | Anon | 2021/07/14 12:33 PM |
Alternatives? | Gabriele Svelto | 2021/07/14 01:19 PM |
Alternatives? | FrankHB | 2021/07/16 07:07 AM |
Alternatives? | cqwrteur | 2021/07/14 12:33 AM |
Alternatives? | Anon | 2021/07/14 01:57 AM |
Alternatives? | cqwrteur | 2021/07/14 02:21 AM |
Alternatives? | dmcq | 2021/07/14 03:06 AM |
Alternatives? | cqwrteur | 2021/07/14 03:50 AM |
Alternatives? | ⚛ | 2021/07/15 08:33 AM |
Alternatives? | FrankHB | 2021/07/16 07:13 AM |
Alternatives? | cqwrteur | 2021/07/14 12:39 AM |
Alternatives? | Anon | 2021/07/14 02:08 AM |
Alternatives? | cqwrteur | 2021/07/14 02:20 AM |
Alternatives? | dmcq | 2021/07/14 02:46 AM |
Alternatives? | cqwrteur | 2021/07/14 02:52 AM |
Alternatives? | dmcq | 2021/07/14 10:13 AM |
Alternatives? | dmcq | 2021/07/14 10:23 AM |
Dealing with memory errors | Brendan | 2021/07/14 12:50 PM |
Dealing with memory errors | dmcq | 2021/07/14 04:27 PM |
Dealing with memory errors | Brendan | 2021/07/14 04:55 PM |
Alternatives? | cqwrteur | 2021/07/14 03:12 AM |
Alternatives? | Anon | 2021/07/14 04:16 AM |
Alternatives? | cqwrteur | 2021/07/14 06:55 AM |
Alternatives? | FrankHB | 2021/07/16 07:27 AM |
Alternatives? | cqwrteur | 2021/07/14 02:38 AM |
Alternatives? | anon | 2021/07/14 03:50 AM |
Stop feeding that troll | none | 2021/07/14 04:13 AM |
Alternatives? | cqwrteur | 2021/07/14 07:39 AM |
Alternatives? | Brendan | 2021/07/14 12:15 PM |
Alternatives? | Anon | 2021/07/14 04:19 AM |
Alternatives? | cqwrteur | 2021/07/14 07:12 AM |
Alternatives? | Anon | 2021/07/14 08:17 AM |
Alternatives? | cqwrteur | 2021/07/14 08:47 AM |
Alternatives? | Anon | 2021/07/14 01:00 PM |
Alternatives? | cqwrteur | 2021/07/14 01:44 PM |
Alternatives? | ⚛ | 2021/07/15 10:36 AM |
Alternatives? | Gabriele Svelto | 2021/07/14 01:26 PM |
Alternatives? | cqwrteur | 2021/07/14 01:46 PM |
Alternatives? | Gabriele Svelto | 2021/07/14 02:36 PM |
Alternatives? | cqwrteur | 2021/07/14 02:55 PM |
Alternatives? | Smoochie | 2021/07/15 12:07 AM |
Alternatives? | ⚛ | 2021/07/15 08:37 AM |
Alternatives? | Brendan | 2021/07/15 11:21 AM |
Alternatives? | Anon | 2021/07/15 01:15 PM |
Alternatives? | FrankHB | 2021/07/16 07:27 AM |
Alternatives? | None | 2021/07/14 02:50 AM |
Alternatives? | cqwrteur | 2021/07/14 02:54 AM |
Alternatives? | cqwrteur | 2021/07/14 02:55 AM |
Alternatives? | Rayla | 2021/07/14 05:47 AM |
Alternatives? | cqwrteur | 2021/07/14 06:54 AM |
Alternatives? | Gabriele Svelto | 2021/07/14 01:43 PM |
Alternatives? | FrankHB | 2021/07/13 12:47 AM |
Alternatives? | FrankHB | 2021/07/13 12:05 AM |
Alternatives? | Michael S | 2021/07/13 01:01 AM |
Alternatives? | FrankHB | 2021/07/13 01:25 AM |
Alternatives? | Doug S | 2021/07/13 12:29 AM |
Alternatives? | cqwrteur | 2021/07/13 12:48 AM |
Alternatives? | FrankHB | 2021/07/13 01:07 AM |
Is unsafe hell truly good for linux kernel in the future? | ⚛ | 2021/07/12 06:27 AM |
Is unsafe hell truly good for linux kernel in the future? | Anon | 2021/07/12 09:46 AM |
Is unsafe hell truly good for linux kernel in the future? | Etienne Lorrain | 2021/07/13 02:00 AM |
Is unsafe hell truly good for linux kernel in the future? | cqwrteur | 2021/07/10 01:38 PM |