By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), August 21, 2018 1:34 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on August 21, 2018 9:09 am wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 20, 2018 4:29 pm wrote:
> >
> > Linux now (merged into the latest released kernel version, 4.18) actually has
> > what could be seen as the reverse of that: "rseq" aka restartable sequences.
> >
> > It doesn't disable preemption (which is crazy and all kinds of stupid), but it does
> > allow user space to see if it has been preempted, and mark certain sequences to
> > be done atomically. And if preemption happens, the sequence gets aborted.
>
> Since threads have phases that benefit from not being significantly interrupted, I think there
> would be value to allowing a thread to express that a phase would extend beyond the normally
> allotted time slice
No.
People will just misuse that, and then you need a lot of BS code to prevent misuses etc.
Don't do it. There is no upside anyway.
> > So you can think of rseq as kind of like the OS equivalent of transactional memory, but instead
> > of the transactional sequence being aborted on a cache conflict, it gets aborted on preemption.
>
> Except that the abort happens after the preemption completes. If a thread gets
> a lock and starts work in a critical section, it cannot release the lock (and undo
> no longer appropriate work) when another thread is waiting on the lock.
That's stupid.
If you get a lock, you get a lock. End of story. You don't need any preemption protection.
No, rseq is very specifically for lockless things. Particularly things where you can do percpu work, and then "finalize" the work with a single store update.
Note that that does not mean that you can't do multiple stores. You'd do multiple stores to your own private copy, and then you do the actual lockless part with that atomic percpu "test everything and update".
And note that the "atomic" here is very much about the percpu "I'm not scheduled away" atomic sense. It's not about atomics in the "lock cacheline" sense (ie load-locked/store-conditional or the x86 "lock" prefix).
If you do locked operations, you're already doing things that are more expensive than rseq is designed for. Again, rseq is designed for per-cpu lockless algorithms. The main use case is literally a percpu memory allocator that doesn't need to take locks. There are others, but memory allocators can be so critical that you might as well see that as the primary one.
Maybe I shouldn't have compared it to TSX, because that migth have given you a false sense of what the intention is. rseq is literally about the "how do I avoid locking entirely, because locks are too expensive".
> (Some of the transactional memory proposals suggest NAKing conflicting remote
> requests or using versioned memory to give a transaction a larger window
> in which to complete, which is similar to extending the time slice.)
Yeah, and that's bullshit too, and causes deadlocks. It adds a *lot* of complexity exactly the same way that "allow timeslice extension" would add. It's a mistake. I know IBM supports that model, but nobody sane calls it simple, and it has a very nasty go-slow case.
Yeah, yeah, even the simple abort model for TSX is broken and the cost of a flushed transaction is way too high, making it hard to do any generic transactional memory use.
You actually want something simpler and more limited, so that you can give performance guarantees and avoid deadlocks.
Maybe you would be happier if I compared rseq to "a fancy compare-multiple-and-exchange". Something that can still fail, but where the critical section is really small so that you can guarantee forward progress without complexity.
> > And that, in turn, can be a big deal when you have 4 cores, but 4 million threads.
> > You don't want to have the memory overhead of per-thread allocations, when
> > all you really wanted was the cache advantages of per-cpu counters.
>
> Per-cpu or per "concurrent" thread?
It's a gray area. In practice, there is no difference.
In theory there might be, but since nobody cares.. Call it either.
> If hardware supported faster local atomics (which is problematic for x86 since the LOCK prefix is
> global in memory scope [and stronger than normal consistency]),
Atomics are expensive on pretty much any architecture.
x86 actually tends to be the best at them, despite the stronger memory model, so trying to make this about x86 is actually bad.
The best rseq speedup numbers actually came from 32-bit ARM, exactly because locking is much more expensive there and 32-bit ARM doesn't even have execution-safe atomics, and x86 does so well.
Note that on x86, you can do read-modify-write operations and it's atomic as long as you use a percpu area (so it's not atomic if other CPU's modify it, but it's atomic wrt interrupts or task switches etc). ARM64 added those too, but many architectures do not.
So your argument about memory consistency is pure and utter garbage. Weak memory consistency and the RISC model in general is actually worst for atomic performance, and you may have to have special "disable interrupts" just to get atomic behavior. x86, depending on what you do, can be a shining example and beacon of sanity.
> > It's a pretty limited use-case, and I don't expect normal users to really ever
> > see it. But it is designed to allow for things like per-cpu malloc libraries etc,
> > and a few other very specific situations where you can take advantage of it.
>
> (Normal application programmers presumably do not really ever
> see system calls but rather higher level abstractions.)
That's not even remotely true.
Things like read/write are very much normal app programmer interfaces. Pretty much all of POSIX is.
Once you get away from those standardized models, though, things change.
And rseq is not just way outside of POSIX, it's so low-level that you really wouldn't want to use it at an app level. So it's a low-level library thing, or perhaps a language abstraction that a compiler can use for per-cpu (as opposed to per-thread, which is easy) operations.
> I think transactional memory can be presented in a way that is useful enough and transparent/portable enough
> to have sufficiently significant
You get back to me when that is actually reality. Right now it's just a theory and not backed up by any kind of data what-so-ever.
As things are now, locking models actually perform better than transactional memory, with the strongest argument for transactional memory being "well, if your locks are bad, transactional memory can help you". So yeah, transactional memory can help for very specific locks that you have trouble making more fine-grained for whatever reason, but doesn't work for locking in general.
And entirely lockless ones perform better than either, but also quite limited.
rseq is about the lockless models, not about the locked ones.
But lockless programming is hard, and usually mainly useful inside some internal library implentation, or in some very special code that makes scalability a big deal (eg a kernel or a database).
Linus
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 20, 2018 4:29 pm wrote:
> >
> > Linux now (merged into the latest released kernel version, 4.18) actually has
> > what could be seen as the reverse of that: "rseq" aka restartable sequences.
> >
> > It doesn't disable preemption (which is crazy and all kinds of stupid), but it does
> > allow user space to see if it has been preempted, and mark certain sequences to
> > be done atomically. And if preemption happens, the sequence gets aborted.
>
> Since threads have phases that benefit from not being significantly interrupted, I think there
> would be value to allowing a thread to express that a phase would extend beyond the normally
> allotted time slice
No.
People will just misuse that, and then you need a lot of BS code to prevent misuses etc.
Don't do it. There is no upside anyway.
> > So you can think of rseq as kind of like the OS equivalent of transactional memory, but instead
> > of the transactional sequence being aborted on a cache conflict, it gets aborted on preemption.
>
> Except that the abort happens after the preemption completes. If a thread gets
> a lock and starts work in a critical section, it cannot release the lock (and undo
> no longer appropriate work) when another thread is waiting on the lock.
That's stupid.
If you get a lock, you get a lock. End of story. You don't need any preemption protection.
No, rseq is very specifically for lockless things. Particularly things where you can do percpu work, and then "finalize" the work with a single store update.
Note that that does not mean that you can't do multiple stores. You'd do multiple stores to your own private copy, and then you do the actual lockless part with that atomic percpu "test everything and update".
And note that the "atomic" here is very much about the percpu "I'm not scheduled away" atomic sense. It's not about atomics in the "lock cacheline" sense (ie load-locked/store-conditional or the x86 "lock" prefix).
If you do locked operations, you're already doing things that are more expensive than rseq is designed for. Again, rseq is designed for per-cpu lockless algorithms. The main use case is literally a percpu memory allocator that doesn't need to take locks. There are others, but memory allocators can be so critical that you might as well see that as the primary one.
Maybe I shouldn't have compared it to TSX, because that migth have given you a false sense of what the intention is. rseq is literally about the "how do I avoid locking entirely, because locks are too expensive".
> (Some of the transactional memory proposals suggest NAKing conflicting remote
> requests or using versioned memory to give a transaction a larger window
> in which to complete, which is similar to extending the time slice.)
Yeah, and that's bullshit too, and causes deadlocks. It adds a *lot* of complexity exactly the same way that "allow timeslice extension" would add. It's a mistake. I know IBM supports that model, but nobody sane calls it simple, and it has a very nasty go-slow case.
Yeah, yeah, even the simple abort model for TSX is broken and the cost of a flushed transaction is way too high, making it hard to do any generic transactional memory use.
You actually want something simpler and more limited, so that you can give performance guarantees and avoid deadlocks.
Maybe you would be happier if I compared rseq to "a fancy compare-multiple-and-exchange". Something that can still fail, but where the critical section is really small so that you can guarantee forward progress without complexity.
> > And that, in turn, can be a big deal when you have 4 cores, but 4 million threads.
> > You don't want to have the memory overhead of per-thread allocations, when
> > all you really wanted was the cache advantages of per-cpu counters.
>
> Per-cpu or per "concurrent" thread?
It's a gray area. In practice, there is no difference.
In theory there might be, but since nobody cares.. Call it either.
> If hardware supported faster local atomics (which is problematic for x86 since the LOCK prefix is
> global in memory scope [and stronger than normal consistency]),
Atomics are expensive on pretty much any architecture.
x86 actually tends to be the best at them, despite the stronger memory model, so trying to make this about x86 is actually bad.
The best rseq speedup numbers actually came from 32-bit ARM, exactly because locking is much more expensive there and 32-bit ARM doesn't even have execution-safe atomics, and x86 does so well.
Note that on x86, you can do read-modify-write operations and it's atomic as long as you use a percpu area (so it's not atomic if other CPU's modify it, but it's atomic wrt interrupts or task switches etc). ARM64 added those too, but many architectures do not.
So your argument about memory consistency is pure and utter garbage. Weak memory consistency and the RISC model in general is actually worst for atomic performance, and you may have to have special "disable interrupts" just to get atomic behavior. x86, depending on what you do, can be a shining example and beacon of sanity.
> > It's a pretty limited use-case, and I don't expect normal users to really ever
> > see it. But it is designed to allow for things like per-cpu malloc libraries etc,
> > and a few other very specific situations where you can take advantage of it.
>
> (Normal application programmers presumably do not really ever
> see system calls but rather higher level abstractions.)
That's not even remotely true.
Things like read/write are very much normal app programmer interfaces. Pretty much all of POSIX is.
Once you get away from those standardized models, though, things change.
And rseq is not just way outside of POSIX, it's so low-level that you really wouldn't want to use it at an app level. So it's a low-level library thing, or perhaps a language abstraction that a compiler can use for per-cpu (as opposed to per-thread, which is easy) operations.
> I think transactional memory can be presented in a way that is useful enough and transparent/portable enough
> to have sufficiently significant
You get back to me when that is actually reality. Right now it's just a theory and not backed up by any kind of data what-so-ever.
As things are now, locking models actually perform better than transactional memory, with the strongest argument for transactional memory being "well, if your locks are bad, transactional memory can help you". So yeah, transactional memory can help for very specific locks that you have trouble making more fine-grained for whatever reason, but doesn't work for locking in general.
And entirely lockless ones perform better than either, but also quite limited.
rseq is about the lockless models, not about the locked ones.
But lockless programming is hard, and usually mainly useful inside some internal library implentation, or in some very special code that makes scalability a big deal (eg a kernel or a database).
Linus
Topic | Posted By | Date |
---|---|---|
ARM turns to a god and a hero | AM | 2018/08/16 08:32 AM |
ARM turns to a god and a hero | Maynard Handley | 2018/08/16 08:41 AM |
ARM turns to a god and a hero | Doug S | 2018/08/16 10:11 AM |
ARM turns to a god and a hero | Geoff Langdale | 2018/08/16 10:59 PM |
ARM turns to a god and a hero | dmcq | 2018/08/17 04:12 AM |
ARM is somewhat misleading | Adrian | 2018/08/16 10:56 PM |
It's marketing material | Gabriele Svelto | 2018/08/17 12:00 AM |
It's marketing material | Michael S | 2018/08/17 02:13 AM |
It's marketing material | dmcq | 2018/08/17 04:23 AM |
It's marketing material | Andrei Frumusanu | 2018/08/17 06:25 AM |
It's marketing material | Linus Torvalds | 2018/08/17 10:20 AM |
It's marketing material | Groo | 2018/08/17 12:44 PM |
It's marketing material | Doug S | 2018/08/17 01:14 PM |
promises and deliveries | AM | 2018/08/17 01:32 PM |
promises and deliveries | Passing Through | 2018/08/17 02:02 PM |
Just by way of clarification | Passing Through | 2018/08/17 02:15 PM |
Just by way of clarification | AM | 2018/08/18 11:49 AM |
Just by way of clarification | Passing Through | 2018/08/18 12:34 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 12:54 PM |
This ain't the nineties any longer | Maynard Handley | 2018/08/18 01:50 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 02:57 PM |
This ain't the nineties any longer | Passing Through | 2018/09/06 01:42 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 03:10 PM |
This ain't the nineties any longer | Passing Through | 2018/09/07 03:48 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 04:22 PM |
Just by way of clarification | Wilco | 2018/08/18 12:26 PM |
Just by way of clarification | Passing Through | 2018/08/18 12:39 PM |
Just by way of clarification | none | 2018/08/18 09:52 PM |
Just by way of clarification | dmcq | 2018/08/19 07:32 AM |
Just by way of clarification | none | 2018/08/19 07:54 AM |
Just by way of clarification | dmcq | 2018/08/19 10:24 AM |
Just by way of clarification | none | 2018/08/19 10:52 AM |
Just by way of clarification | Gabriele Svelto | 2018/08/19 05:41 AM |
Just by way of clarification | Passing Through | 2018/08/19 08:25 AM |
Whiteboards at Gatwick airport anyone? | Passing Through | 2018/08/20 03:24 AM |
It's marketing material | Michael S | 2018/08/18 10:12 AM |
It's marketing material | Brett | 2018/08/18 04:22 PM |
It's marketing material | Brett | 2018/08/18 04:33 PM |
It's marketing material | Adrian | 2018/08/19 12:21 AM |
A76 | AM | 2018/08/17 01:45 PM |
A76 | Michael S | 2018/08/18 10:20 AM |
A76 | AM | 2018/08/18 11:39 AM |
A76 | Michael S | 2018/08/18 11:49 AM |
A76 | AM | 2018/08/18 12:06 PM |
A76 | Doug S | 2018/08/18 12:43 PM |
A76 | Maynard Handley | 2018/08/18 01:42 PM |
A76 | Maynard Handley | 2018/08/18 03:22 PM |
Why write zeros when one can use metadata? | Paul A. Clayton | 2018/08/18 05:19 PM |
Why write zeros when one can use metadata? | Maynard Handley | 2018/08/19 10:12 AM |
Dictionary compress might apply to memcopy | Paul A. Clayton | 2018/08/19 12:45 PM |
Instructions for zeroing | Konrad Schwarz | 2018/08/30 05:37 AM |
Instructions for zeroing | Maynard Handley | 2018/08/30 07:41 AM |
Instructions for zeroing | Adrian | 2018/08/30 10:37 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | hobold | 2018/08/31 12:50 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | dmcq | 2018/09/01 04:28 AM |
A76 | Travis | 2018/08/19 10:36 AM |
A76 | Maynard Handley | 2018/08/19 11:22 AM |
A76 | Travis | 2018/08/19 01:07 PM |
A76 | Maynard Handley | 2018/08/19 05:24 PM |
Remote atomics | matthew | 2018/08/19 11:51 AM |
Remote atomics | Michael S | 2018/08/19 12:58 PM |
Remote atomics | matthew | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:36 PM |
Remote atomics | matthew | 2018/08/19 01:48 PM |
Remote atomics | Michael S | 2018/08/19 02:16 PM |
Remote atomics | Ricardo B | 2018/08/20 09:05 AM |
Remote atomics | dmcq | 2018/08/19 01:33 PM |
Remote atomics | Travis | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:46 PM |
Remote atomics | Travis | 2018/08/19 04:35 PM |
Remote atomics | Michael S | 2018/08/20 02:29 AM |
Remote atomics | matthew | 2018/08/19 06:58 PM |
Remote atomics | anon | 2018/08/19 11:59 PM |
Remote atomics | Travis | 2018/08/20 09:26 AM |
Remote atomics | Travis | 2018/08/20 08:57 AM |
Remote atomics | Linus Torvalds | 2018/08/20 03:29 PM |
Fitting time slices to execution phases | Paul A. Clayton | 2018/08/21 08:09 AM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 01:34 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 02:31 PM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/21 02:54 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:26 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:21 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:39 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:59 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 04:13 PM |
Fitting time slices to execution phases | anon | 2018/08/21 03:27 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 05:02 PM |
Fitting time slices to execution phases | Etienne | 2018/08/22 01:28 AM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/22 02:07 PM |
Fitting time slices to execution phases | Travis | 2018/08/22 03:00 PM |
Fitting time slices to execution phases | anon | 2018/08/22 05:52 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:37 PM |
Is preventing misuse that complex? | Paul A. Clayton | 2018/08/23 04:42 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 11:46 AM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:29 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:33 PM |
Is preventing misuse that complex? | Jeff S. | 2018/08/24 06:57 AM |
Is preventing misuse that complex? | Travis | 2018/08/24 07:47 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 01:30 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 02:11 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:00 PM |
Is preventing misuse that complex? | Gabriele Svelto | 2018/08/24 12:25 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:33 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 02:54 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 02:18 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 02:59 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 03:27 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 04:10 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 05:21 PM |
ARM design houses | Michael S | 2018/08/21 04:07 AM |
ARM design houses | Wilco | 2018/08/22 11:38 AM |
ARM design houses | Michael S | 2018/08/22 01:21 PM |
ARM design houses | Wilco | 2018/08/22 02:23 PM |
ARM design houses | Michael S | 2018/08/29 12:58 AM |
Qualcomm's core naming scheme really, really sucks | Heikki Kultala | 2018/08/29 01:19 AM |
A76 | Maynard Handley | 2018/08/18 01:07 PM |
A76 | Michael S | 2018/08/18 01:32 PM |
A76 | Maynard Handley | 2018/08/18 01:52 PM |
A76 | Michael S | 2018/08/18 02:04 PM |
ARM is somewhat misleading | juanrga | 2018/08/17 12:20 AM |
Surprised?? | Alberto | 2018/08/17 12:52 AM |
Surprised?? | Alberto | 2018/08/17 01:10 AM |
Surprised?? | none | 2018/08/17 01:46 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 06:30 AM |
Garbage talk | Michael S | 2018/08/17 06:43 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:51 AM |
Garbage talk | Michael S | 2018/08/18 10:29 AM |
Garbage talk | Adrian | 2018/08/17 07:28 AM |
Garbage talk | Alberto | 2018/08/17 08:20 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:48 AM |
Garbage talk | Adrian | 2018/08/17 09:17 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 09:36 AM |
Garbage talk | Adrian | 2018/08/17 01:53 PM |
Garbage talk | Andrei Frumusanu | 2018/08/17 11:17 PM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 08:13 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 08:34 AM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 09:03 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 09:43 AM |
More like a religion he?? ARM has an easy life :) | Doug S | 2018/08/17 01:17 PM |
15W phone SoCs | AM | 2018/08/17 02:04 PM |
More like a religion he?? ARM has an easy life :) | Maynard Handley | 2018/08/17 11:29 AM |
my future stuff will be better than your old stuff, hey I'm a god at last (NT) | Eric Bron | 2018/08/18 02:34 AM |
my future stuff will be better than your old stuff, hey I'm a god at last | none | 2018/08/18 07:34 AM |