By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), August 21, 2018 2:31 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 21, 2018 2:34 pm wrote:
>
> I Again, rseq is designed for per-cpu lockless algorithms. The main use case is
> literally a percpu memory allocator that doesn't need to take locks. There are others, but
> memory allocators can be so critical that you might as well see that as the primary one.
Actually, another thing worth mentioning is just "statistics".
It sounds ridiculously easy to do things like "count number of packets processed" or "count hash misses vs hits", but it turns out that in heavily threaded environments, doing things like operation statistics is sometimes some of the nastiest stuff there is.
It's trivial to do per-thread: just increment the counter in the per-thread area. Boom. Done. There are no races, because even on the worst kind of load-store machine you can just do
and you're done. No locks, no cache conflicts with other threads, no nothing.
The problem with the per-thread model is that it can be very expensive indeed to add all those per-thread statistics up. Thousands of threads is not even unusual. So maybe you can deal with some slop, and just have the threads periodically empoty their per-thread statistics into some cache-coherent pool or something.
It's solvable, don't get me wrong. But it has problems.
So you can solve it other ways. One way is to have the remote atomics, and just do the statistics update "somewhere else" and asynchronously. That's a lovely model in many ways, but it's not portable.
Or you can just do actual cache-coherent locked loads to a global counter. That works fine for most cases, but it gets to be a complete disaster for the really hot counters on big machines. You'll potentially spend more time doing the statistics than you spend on the work you did.
Or you can play games. You could reserve one register for a "percpu base" - the same way you already have a base pointer for the thread local area, and rely on the OS updating it, and if you have a locally atomic (ie not SMP) instruction to do a read-modify-write, you can just do a single instruction
and you're fine. On x86 you could use a segment register for the percpu base, the same way you use a segment register for the thread-local storage, and you're done.
With the above you get great cache behavior, no cross-cpu migration at all, and much better scaling too - you can add up the statistics by just iterating over the number of CPU's. Sure, that might still be a reasonably big number on a studly machine, but it's a much better proposition than the per-thread case.
But the above already requires that OS support for per-cpu base updates as you move around, and it also requires that you do have that single-instruction read-modify-write. So in practice, you can't actually do it, because it is so unportable.
So this is the kind of thing rseq can help with. A single memory add to update statistics - it should be really trivial and easy, but it actually is anything but. It's complex enough that people don't even do it at all.
With rseq, you can do basically the above "portably" (you have to have rseq) with two normal reads and write instruciton. Regular reads and writes too, not some slow cache-atomic ones.
There's a trivial (again normal instrucvtion) sequence to start the critical section (to a magic location you had registered with the kernel earlier), one read of the percpu base value, one read of the actual counter value, and one write of the new counter value.
If you got scheduled away, you get aborted and restart the sequence (otr maybe you decide to fall back to cacheline atomic, that's probably the better model anyway)
That kind of silly low-level thing that sounds trivial - but really isn't - is what rseq is about. Things that sound so simple that you'd not even realize that they are hard to implement.
So per-cpu memory allocators, trace buffers, counters. Things that sound obvious and trivial, but are pretty basic and aren't trivial at all in user space, and where locking is much too expensive, and transactional memory doesn't even help.
Linus
>
> I Again, rseq is designed for per-cpu lockless algorithms. The main use case is
> literally a percpu memory allocator that doesn't need to take locks. There are others, but
> memory allocators can be so critical that you might as well see that as the primary one.
Actually, another thing worth mentioning is just "statistics".
It sounds ridiculously easy to do things like "count number of packets processed" or "count hash misses vs hits", but it turns out that in heavily threaded environments, doing things like operation statistics is sometimes some of the nastiest stuff there is.
It's trivial to do per-thread: just increment the counter in the per-thread area. Boom. Done. There are no races, because even on the worst kind of load-store machine you can just do
get thread pointer
load counter from memory off thread pointer
increment counter
store new counter off thread pointer
and you're done. No locks, no cache conflicts with other threads, no nothing.
The problem with the per-thread model is that it can be very expensive indeed to add all those per-thread statistics up. Thousands of threads is not even unusual. So maybe you can deal with some slop, and just have the threads periodically empoty their per-thread statistics into some cache-coherent pool or something.
It's solvable, don't get me wrong. But it has problems.
So you can solve it other ways. One way is to have the remote atomics, and just do the statistics update "somewhere else" and asynchronously. That's a lovely model in many ways, but it's not portable.
Or you can just do actual cache-coherent locked loads to a global counter. That works fine for most cases, but it gets to be a complete disaster for the really hot counters on big machines. You'll potentially spend more time doing the statistics than you spend on the work you did.
Or you can play games. You could reserve one register for a "percpu base" - the same way you already have a base pointer for the thread local area, and rely on the OS updating it, and if you have a locally atomic (ie not SMP) instruction to do a read-modify-write, you can just do a single instruction
increment statistics(%cpubase)
and you're fine. On x86 you could use a segment register for the percpu base, the same way you use a segment register for the thread-local storage, and you're done.
With the above you get great cache behavior, no cross-cpu migration at all, and much better scaling too - you can add up the statistics by just iterating over the number of CPU's. Sure, that might still be a reasonably big number on a studly machine, but it's a much better proposition than the per-thread case.
But the above already requires that OS support for per-cpu base updates as you move around, and it also requires that you do have that single-instruction read-modify-write. So in practice, you can't actually do it, because it is so unportable.
So this is the kind of thing rseq can help with. A single memory add to update statistics - it should be really trivial and easy, but it actually is anything but. It's complex enough that people don't even do it at all.
With rseq, you can do basically the above "portably" (you have to have rseq) with two normal reads and write instruciton. Regular reads and writes too, not some slow cache-atomic ones.
There's a trivial (again normal instrucvtion) sequence to start the critical section (to a magic location you had registered with the kernel earlier), one read of the percpu base value, one read of the actual counter value, and one write of the new counter value.
If you got scheduled away, you get aborted and restart the sequence (otr maybe you decide to fall back to cacheline atomic, that's probably the better model anyway)
That kind of silly low-level thing that sounds trivial - but really isn't - is what rseq is about. Things that sound so simple that you'd not even realize that they are hard to implement.
So per-cpu memory allocators, trace buffers, counters. Things that sound obvious and trivial, but are pretty basic and aren't trivial at all in user space, and where locking is much too expensive, and transactional memory doesn't even help.
Linus
Topic | Posted By | Date |
---|---|---|
ARM turns to a god and a hero | AM | 2018/08/16 08:32 AM |
ARM turns to a god and a hero | Maynard Handley | 2018/08/16 08:41 AM |
ARM turns to a god and a hero | Doug S | 2018/08/16 10:11 AM |
ARM turns to a god and a hero | Geoff Langdale | 2018/08/16 10:59 PM |
ARM turns to a god and a hero | dmcq | 2018/08/17 04:12 AM |
ARM is somewhat misleading | Adrian | 2018/08/16 10:56 PM |
It's marketing material | Gabriele Svelto | 2018/08/17 12:00 AM |
It's marketing material | Michael S | 2018/08/17 02:13 AM |
It's marketing material | dmcq | 2018/08/17 04:23 AM |
It's marketing material | Andrei Frumusanu | 2018/08/17 06:25 AM |
It's marketing material | Linus Torvalds | 2018/08/17 10:20 AM |
It's marketing material | Groo | 2018/08/17 12:44 PM |
It's marketing material | Doug S | 2018/08/17 01:14 PM |
promises and deliveries | AM | 2018/08/17 01:32 PM |
promises and deliveries | Passing Through | 2018/08/17 02:02 PM |
Just by way of clarification | Passing Through | 2018/08/17 02:15 PM |
Just by way of clarification | AM | 2018/08/18 11:49 AM |
Just by way of clarification | Passing Through | 2018/08/18 12:34 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 12:54 PM |
This ain't the nineties any longer | Maynard Handley | 2018/08/18 01:50 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 02:57 PM |
This ain't the nineties any longer | Passing Through | 2018/09/06 01:42 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 03:10 PM |
This ain't the nineties any longer | Passing Through | 2018/09/07 03:48 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 04:22 PM |
Just by way of clarification | Wilco | 2018/08/18 12:26 PM |
Just by way of clarification | Passing Through | 2018/08/18 12:39 PM |
Just by way of clarification | none | 2018/08/18 09:52 PM |
Just by way of clarification | dmcq | 2018/08/19 07:32 AM |
Just by way of clarification | none | 2018/08/19 07:54 AM |
Just by way of clarification | dmcq | 2018/08/19 10:24 AM |
Just by way of clarification | none | 2018/08/19 10:52 AM |
Just by way of clarification | Gabriele Svelto | 2018/08/19 05:41 AM |
Just by way of clarification | Passing Through | 2018/08/19 08:25 AM |
Whiteboards at Gatwick airport anyone? | Passing Through | 2018/08/20 03:24 AM |
It's marketing material | Michael S | 2018/08/18 10:12 AM |
It's marketing material | Brett | 2018/08/18 04:22 PM |
It's marketing material | Brett | 2018/08/18 04:33 PM |
It's marketing material | Adrian | 2018/08/19 12:21 AM |
A76 | AM | 2018/08/17 01:45 PM |
A76 | Michael S | 2018/08/18 10:20 AM |
A76 | AM | 2018/08/18 11:39 AM |
A76 | Michael S | 2018/08/18 11:49 AM |
A76 | AM | 2018/08/18 12:06 PM |
A76 | Doug S | 2018/08/18 12:43 PM |
A76 | Maynard Handley | 2018/08/18 01:42 PM |
A76 | Maynard Handley | 2018/08/18 03:22 PM |
Why write zeros when one can use metadata? | Paul A. Clayton | 2018/08/18 05:19 PM |
Why write zeros when one can use metadata? | Maynard Handley | 2018/08/19 10:12 AM |
Dictionary compress might apply to memcopy | Paul A. Clayton | 2018/08/19 12:45 PM |
Instructions for zeroing | Konrad Schwarz | 2018/08/30 05:37 AM |
Instructions for zeroing | Maynard Handley | 2018/08/30 07:41 AM |
Instructions for zeroing | Adrian | 2018/08/30 10:37 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | hobold | 2018/08/31 12:50 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | dmcq | 2018/09/01 04:28 AM |
A76 | Travis | 2018/08/19 10:36 AM |
A76 | Maynard Handley | 2018/08/19 11:22 AM |
A76 | Travis | 2018/08/19 01:07 PM |
A76 | Maynard Handley | 2018/08/19 05:24 PM |
Remote atomics | matthew | 2018/08/19 11:51 AM |
Remote atomics | Michael S | 2018/08/19 12:58 PM |
Remote atomics | matthew | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:36 PM |
Remote atomics | matthew | 2018/08/19 01:48 PM |
Remote atomics | Michael S | 2018/08/19 02:16 PM |
Remote atomics | Ricardo B | 2018/08/20 09:05 AM |
Remote atomics | dmcq | 2018/08/19 01:33 PM |
Remote atomics | Travis | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:46 PM |
Remote atomics | Travis | 2018/08/19 04:35 PM |
Remote atomics | Michael S | 2018/08/20 02:29 AM |
Remote atomics | matthew | 2018/08/19 06:58 PM |
Remote atomics | anon | 2018/08/19 11:59 PM |
Remote atomics | Travis | 2018/08/20 09:26 AM |
Remote atomics | Travis | 2018/08/20 08:57 AM |
Remote atomics | Linus Torvalds | 2018/08/20 03:29 PM |
Fitting time slices to execution phases | Paul A. Clayton | 2018/08/21 08:09 AM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 01:34 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 02:31 PM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/21 02:54 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:26 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:21 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:39 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:59 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 04:13 PM |
Fitting time slices to execution phases | anon | 2018/08/21 03:27 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 05:02 PM |
Fitting time slices to execution phases | Etienne | 2018/08/22 01:28 AM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/22 02:07 PM |
Fitting time slices to execution phases | Travis | 2018/08/22 03:00 PM |
Fitting time slices to execution phases | anon | 2018/08/22 05:52 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:37 PM |
Is preventing misuse that complex? | Paul A. Clayton | 2018/08/23 04:42 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 11:46 AM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:29 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:33 PM |
Is preventing misuse that complex? | Jeff S. | 2018/08/24 06:57 AM |
Is preventing misuse that complex? | Travis | 2018/08/24 07:47 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 01:30 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 02:11 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:00 PM |
Is preventing misuse that complex? | Gabriele Svelto | 2018/08/24 12:25 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:33 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 02:54 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 02:18 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 02:59 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 03:27 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 04:10 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 05:21 PM |
ARM design houses | Michael S | 2018/08/21 04:07 AM |
ARM design houses | Wilco | 2018/08/22 11:38 AM |
ARM design houses | Michael S | 2018/08/22 01:21 PM |
ARM design houses | Wilco | 2018/08/22 02:23 PM |
ARM design houses | Michael S | 2018/08/29 12:58 AM |
Qualcomm's core naming scheme really, really sucks | Heikki Kultala | 2018/08/29 01:19 AM |
A76 | Maynard Handley | 2018/08/18 01:07 PM |
A76 | Michael S | 2018/08/18 01:32 PM |
A76 | Maynard Handley | 2018/08/18 01:52 PM |
A76 | Michael S | 2018/08/18 02:04 PM |
ARM is somewhat misleading | juanrga | 2018/08/17 12:20 AM |
Surprised?? | Alberto | 2018/08/17 12:52 AM |
Surprised?? | Alberto | 2018/08/17 01:10 AM |
Surprised?? | none | 2018/08/17 01:46 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 06:30 AM |
Garbage talk | Michael S | 2018/08/17 06:43 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:51 AM |
Garbage talk | Michael S | 2018/08/18 10:29 AM |
Garbage talk | Adrian | 2018/08/17 07:28 AM |
Garbage talk | Alberto | 2018/08/17 08:20 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:48 AM |
Garbage talk | Adrian | 2018/08/17 09:17 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 09:36 AM |
Garbage talk | Adrian | 2018/08/17 01:53 PM |
Garbage talk | Andrei Frumusanu | 2018/08/17 11:17 PM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 08:13 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 08:34 AM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 09:03 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 09:43 AM |
More like a religion he?? ARM has an easy life :) | Doug S | 2018/08/17 01:17 PM |
15W phone SoCs | AM | 2018/08/17 02:04 PM |
More like a religion he?? ARM has an easy life :) | Maynard Handley | 2018/08/17 11:29 AM |
my future stuff will be better than your old stuff, hey I'm a god at last (NT) | Eric Bron | 2018/08/18 02:34 AM |
my future stuff will be better than your old stuff, hey I'm a god at last | none | 2018/08/18 07:34 AM |