By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), August 21, 2018 8:09 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 20, 2018 4:29 pm wrote:
> matthew (nobody.delete@this.example.com) on August 19, 2018 7:58 pm wrote:
> >
> > Solaris had the ability for a thread to tell the kernel it was holding a mutex and that it shouldn't
> > be preempted until it had dropped the lock. There have been several attempts to add something
> > like that to Linux, but none have succeeded yet. Nothing with hardware assists either.
>
> Linux now (merged into the latest released kernel version, 4.18) actually has
> what could be seen as the reverse of that: "rseq" aka restartable sequences.
>
> It doesn't disable preemption (which is crazy and all kinds of stupid), but it does
> allow user space to see if it has been preempted, and mark certain sequences to
> be done atomically. And if preemption happens, the sequence gets aborted.
Since threads have phases that benefit from not being significantly interrupted, I think there would be value to allowing a thread to express that a phase would extend beyond the normally allotted time slice (and to be able to end a time slice early while still informing the scheduler that more work is available, i.e., that the point is a convenient stopping place).
Critical sections guarded by locks are especially important phases because they hinder forward progress by other threads. Furthermore such tend to be short, so requesting an additional sub-millisecond of execution time would not seem to be especially disruptive to responsiveness.
(Rather than skipping a preemption, a thread might query how many cycles it has in its execution allocation before entering the critical section and yield with a request for continuation, possibly giving a length in the request, if the allocation is insufficient for desired likelihood of completion of a phase. Hardware support seems likely to be helpful in this, at least some means to cheaply determine approximately how much execution time a thread has remaining; hardware support might also make such less helpful for timing side channels.)
Phases involving cache (or branch predictor) warm-up would typically be longer and have less impact on performance, but phase information might still be useful.
This has obvious similarities to core/cache/memory node affinity. The request is more a hint than a directive, but some workloads might highly value certain kinds of affinity such that they are not worth running without such. (This also relates to real time requirements.)
(One could also argue for communicating the degree of tolerance of descheduling. L3 warm-up would be more tolerant of moderate duration interference from other threads than L1 warm-up. Some locks would tolerate more delay in release than others; some delay tolerance is dynamic and some static, so communication of intent/expectation may be a better hint than a request for a specific behavior. A market for resources and warranties (a warranty being like an inverse bid, failure to get the resource as contracted earns credit potentially much greater than the purchase price) could handle a variety of resources and requirements.)
> So you can think of rseq as kind of like the OS equivalent of transactional memory, but instead
> of the transactional sequence being aborted on a cache conflict, it gets aborted on preemption.
Except that the abort happens after the preemption completes. If a thread gets a lock and starts work in a critical section, it cannot release the lock (and undo no longer appropriate work) when another thread is waiting on the lock.
(Some of the transactional memory proposals suggest NAKing conflicting remote requests or using versioned memory to give a transaction a larger window in which to complete, which is similar to extending the time slice.)
> That allows you to do certain per-cpu things in user space (as opposed to per-thread).
>
> And that, in turn, can be a big deal when you have 4 cores, but 4 million threads.
> You don't want to have the memory overhead of per-thread allocations, when
> all you really wanted was the cache advantages of per-cpu counters.
Per-cpu or per "concurrent" thread? (Hardware multithreading does not have to be presented as virtual processors.) I am assuming the latter (for atomicity guarantees). For in-program local atomicity, atomicity failure on interruption may be excessively conservative. Tracking by "atomicity thread group" might not be excessively complex (though it might not have much advantage since interrupts are relatively rare).
If hardware supported faster local atomics (which is problematic for x86 since the LOCK prefix is global in memory scope [and stronger than normal consistency]), cache affinity might have further application. I doubt there would be much use for a non-concurrency check at L2 cache level (e.g., if eight software thread groups only interfere within a group and as long as no two hardware threads within a group are scheduled concurrently, "local" accesses are "interrupt atomic"), but such might be a path worth some thought to discover unexpected opportunities in more ordinary uses.
> It's a pretty limited use-case, and I don't expect normal users to really ever
> see it. But it is designed to allow for things like per-cpu malloc libraries etc,
> and a few other very specific situations where you can take advantage of it.
(Normal application programmers presumably do not really ever see system calls but rather higher level abstractions.)
> We'll see if people end up taking advantage of it. The downside with a lot of clever interfaces is that because
> they are non-standard, you really don't see people using them unless there is a big win or unless you can transparently
> hide them in a library with absolutely zero downside from the portable standard approach.
>
> Which is basically what seems to have killed transactional memory. The library approach (HLE) ends up performing
> horribly in many real-life situations, so it's not possible to use as a direct transparent replacement,
> and the full transactional model is so non-portable that it's not worth spending effort on.
I think transactional memory can be presented in a way that is useful enough and transparent/portable enough to have sufficiently significant (worthwhile) and broad (debugging/optimizing) adoption, but I have not given such the extended consideration it requires to work out a reasonable interface. Some changes would probably have to be made to programming interfaces (though some intention can sometimes be discovered by the compiler from "ordinary" source code); ideally such changes would make programming easier and less dangerous.
For HLE to provide behavior similar to general transactional memory, it seems that hardware would have to lie about the returned state of a lock. I.e., a lock that is not actually held (which other threads are "holding" transactionally) would present a locked value, so the thread could choose to spin, pause, or do other work. Intel's HLE makes such more difficult; the hardware can look at the post lock acquire test and cause it to fail, but if the lock contains other information hardware could not easily fake this information.
Even if HLE never had significantly worse performance than a lock and generally better performance, adoption and optimization would be artificially constrained. Part of the point of Intel's HLE is to encourage portability (code will work on any x86) and low-effort adoption. This is a good initial goal, but one also wants an interface and implementation/expectation that encourages optimization. There is not even a platform-level "guarantee" of behavior, so optimization is constrained.
(This may be somewhat similar to the indexing (and associativity and replacement policies) of caches. Modulo a power of two indexing encourages certain coding practices which may be unnecessary or even harmful under hardware that has different conflict behavior. If skewed associativity, quasi-prime modulo indexing, or other conflict changing cache implementation was provided sometimes without even a firm indication of persistence even within a platform, changing software would only be justifiable for short term gains.)
> We'll see if rseq can do better.
It seems it already has significant adoption for memory allocation.
> matthew (nobody.delete@this.example.com) on August 19, 2018 7:58 pm wrote:
> >
> > Solaris had the ability for a thread to tell the kernel it was holding a mutex and that it shouldn't
> > be preempted until it had dropped the lock. There have been several attempts to add something
> > like that to Linux, but none have succeeded yet. Nothing with hardware assists either.
>
> Linux now (merged into the latest released kernel version, 4.18) actually has
> what could be seen as the reverse of that: "rseq" aka restartable sequences.
>
> It doesn't disable preemption (which is crazy and all kinds of stupid), but it does
> allow user space to see if it has been preempted, and mark certain sequences to
> be done atomically. And if preemption happens, the sequence gets aborted.
Since threads have phases that benefit from not being significantly interrupted, I think there would be value to allowing a thread to express that a phase would extend beyond the normally allotted time slice (and to be able to end a time slice early while still informing the scheduler that more work is available, i.e., that the point is a convenient stopping place).
Critical sections guarded by locks are especially important phases because they hinder forward progress by other threads. Furthermore such tend to be short, so requesting an additional sub-millisecond of execution time would not seem to be especially disruptive to responsiveness.
(Rather than skipping a preemption, a thread might query how many cycles it has in its execution allocation before entering the critical section and yield with a request for continuation, possibly giving a length in the request, if the allocation is insufficient for desired likelihood of completion of a phase. Hardware support seems likely to be helpful in this, at least some means to cheaply determine approximately how much execution time a thread has remaining; hardware support might also make such less helpful for timing side channels.)
Phases involving cache (or branch predictor) warm-up would typically be longer and have less impact on performance, but phase information might still be useful.
This has obvious similarities to core/cache/memory node affinity. The request is more a hint than a directive, but some workloads might highly value certain kinds of affinity such that they are not worth running without such. (This also relates to real time requirements.)
(One could also argue for communicating the degree of tolerance of descheduling. L3 warm-up would be more tolerant of moderate duration interference from other threads than L1 warm-up. Some locks would tolerate more delay in release than others; some delay tolerance is dynamic and some static, so communication of intent/expectation may be a better hint than a request for a specific behavior. A market for resources and warranties (a warranty being like an inverse bid, failure to get the resource as contracted earns credit potentially much greater than the purchase price) could handle a variety of resources and requirements.)
> So you can think of rseq as kind of like the OS equivalent of transactional memory, but instead
> of the transactional sequence being aborted on a cache conflict, it gets aborted on preemption.
Except that the abort happens after the preemption completes. If a thread gets a lock and starts work in a critical section, it cannot release the lock (and undo no longer appropriate work) when another thread is waiting on the lock.
(Some of the transactional memory proposals suggest NAKing conflicting remote requests or using versioned memory to give a transaction a larger window in which to complete, which is similar to extending the time slice.)
> That allows you to do certain per-cpu things in user space (as opposed to per-thread).
>
> And that, in turn, can be a big deal when you have 4 cores, but 4 million threads.
> You don't want to have the memory overhead of per-thread allocations, when
> all you really wanted was the cache advantages of per-cpu counters.
Per-cpu or per "concurrent" thread? (Hardware multithreading does not have to be presented as virtual processors.) I am assuming the latter (for atomicity guarantees). For in-program local atomicity, atomicity failure on interruption may be excessively conservative. Tracking by "atomicity thread group" might not be excessively complex (though it might not have much advantage since interrupts are relatively rare).
If hardware supported faster local atomics (which is problematic for x86 since the LOCK prefix is global in memory scope [and stronger than normal consistency]), cache affinity might have further application. I doubt there would be much use for a non-concurrency check at L2 cache level (e.g., if eight software thread groups only interfere within a group and as long as no two hardware threads within a group are scheduled concurrently, "local" accesses are "interrupt atomic"), but such might be a path worth some thought to discover unexpected opportunities in more ordinary uses.
> It's a pretty limited use-case, and I don't expect normal users to really ever
> see it. But it is designed to allow for things like per-cpu malloc libraries etc,
> and a few other very specific situations where you can take advantage of it.
(Normal application programmers presumably do not really ever see system calls but rather higher level abstractions.)
> We'll see if people end up taking advantage of it. The downside with a lot of clever interfaces is that because
> they are non-standard, you really don't see people using them unless there is a big win or unless you can transparently
> hide them in a library with absolutely zero downside from the portable standard approach.
>
> Which is basically what seems to have killed transactional memory. The library approach (HLE) ends up performing
> horribly in many real-life situations, so it's not possible to use as a direct transparent replacement,
> and the full transactional model is so non-portable that it's not worth spending effort on.
I think transactional memory can be presented in a way that is useful enough and transparent/portable enough to have sufficiently significant (worthwhile) and broad (debugging/optimizing) adoption, but I have not given such the extended consideration it requires to work out a reasonable interface. Some changes would probably have to be made to programming interfaces (though some intention can sometimes be discovered by the compiler from "ordinary" source code); ideally such changes would make programming easier and less dangerous.
For HLE to provide behavior similar to general transactional memory, it seems that hardware would have to lie about the returned state of a lock. I.e., a lock that is not actually held (which other threads are "holding" transactionally) would present a locked value, so the thread could choose to spin, pause, or do other work. Intel's HLE makes such more difficult; the hardware can look at the post lock acquire test and cause it to fail, but if the lock contains other information hardware could not easily fake this information.
Even if HLE never had significantly worse performance than a lock and generally better performance, adoption and optimization would be artificially constrained. Part of the point of Intel's HLE is to encourage portability (code will work on any x86) and low-effort adoption. This is a good initial goal, but one also wants an interface and implementation/expectation that encourages optimization. There is not even a platform-level "guarantee" of behavior, so optimization is constrained.
(This may be somewhat similar to the indexing (and associativity and replacement policies) of caches. Modulo a power of two indexing encourages certain coding practices which may be unnecessary or even harmful under hardware that has different conflict behavior. If skewed associativity, quasi-prime modulo indexing, or other conflict changing cache implementation was provided sometimes without even a firm indication of persistence even within a platform, changing software would only be justifiable for short term gains.)
> We'll see if rseq can do better.
It seems it already has significant adoption for memory allocation.
Topic | Posted By | Date |
---|---|---|
ARM turns to a god and a hero | AM | 2018/08/16 08:32 AM |
ARM turns to a god and a hero | Maynard Handley | 2018/08/16 08:41 AM |
ARM turns to a god and a hero | Doug S | 2018/08/16 10:11 AM |
ARM turns to a god and a hero | Geoff Langdale | 2018/08/16 10:59 PM |
ARM turns to a god and a hero | dmcq | 2018/08/17 04:12 AM |
ARM is somewhat misleading | Adrian | 2018/08/16 10:56 PM |
It's marketing material | Gabriele Svelto | 2018/08/17 12:00 AM |
It's marketing material | Michael S | 2018/08/17 02:13 AM |
It's marketing material | dmcq | 2018/08/17 04:23 AM |
It's marketing material | Andrei Frumusanu | 2018/08/17 06:25 AM |
It's marketing material | Linus Torvalds | 2018/08/17 10:20 AM |
It's marketing material | Groo | 2018/08/17 12:44 PM |
It's marketing material | Doug S | 2018/08/17 01:14 PM |
promises and deliveries | AM | 2018/08/17 01:32 PM |
promises and deliveries | Passing Through | 2018/08/17 02:02 PM |
Just by way of clarification | Passing Through | 2018/08/17 02:15 PM |
Just by way of clarification | AM | 2018/08/18 11:49 AM |
Just by way of clarification | Passing Through | 2018/08/18 12:34 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 12:54 PM |
This ain't the nineties any longer | Maynard Handley | 2018/08/18 01:50 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 02:57 PM |
This ain't the nineties any longer | Passing Through | 2018/09/06 01:42 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 03:10 PM |
This ain't the nineties any longer | Passing Through | 2018/09/07 03:48 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 04:22 PM |
Just by way of clarification | Wilco | 2018/08/18 12:26 PM |
Just by way of clarification | Passing Through | 2018/08/18 12:39 PM |
Just by way of clarification | none | 2018/08/18 09:52 PM |
Just by way of clarification | dmcq | 2018/08/19 07:32 AM |
Just by way of clarification | none | 2018/08/19 07:54 AM |
Just by way of clarification | dmcq | 2018/08/19 10:24 AM |
Just by way of clarification | none | 2018/08/19 10:52 AM |
Just by way of clarification | Gabriele Svelto | 2018/08/19 05:41 AM |
Just by way of clarification | Passing Through | 2018/08/19 08:25 AM |
Whiteboards at Gatwick airport anyone? | Passing Through | 2018/08/20 03:24 AM |
It's marketing material | Michael S | 2018/08/18 10:12 AM |
It's marketing material | Brett | 2018/08/18 04:22 PM |
It's marketing material | Brett | 2018/08/18 04:33 PM |
It's marketing material | Adrian | 2018/08/19 12:21 AM |
A76 | AM | 2018/08/17 01:45 PM |
A76 | Michael S | 2018/08/18 10:20 AM |
A76 | AM | 2018/08/18 11:39 AM |
A76 | Michael S | 2018/08/18 11:49 AM |
A76 | AM | 2018/08/18 12:06 PM |
A76 | Doug S | 2018/08/18 12:43 PM |
A76 | Maynard Handley | 2018/08/18 01:42 PM |
A76 | Maynard Handley | 2018/08/18 03:22 PM |
Why write zeros when one can use metadata? | Paul A. Clayton | 2018/08/18 05:19 PM |
Why write zeros when one can use metadata? | Maynard Handley | 2018/08/19 10:12 AM |
Dictionary compress might apply to memcopy | Paul A. Clayton | 2018/08/19 12:45 PM |
Instructions for zeroing | Konrad Schwarz | 2018/08/30 05:37 AM |
Instructions for zeroing | Maynard Handley | 2018/08/30 07:41 AM |
Instructions for zeroing | Adrian | 2018/08/30 10:37 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | hobold | 2018/08/31 12:50 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | dmcq | 2018/09/01 04:28 AM |
A76 | Travis | 2018/08/19 10:36 AM |
A76 | Maynard Handley | 2018/08/19 11:22 AM |
A76 | Travis | 2018/08/19 01:07 PM |
A76 | Maynard Handley | 2018/08/19 05:24 PM |
Remote atomics | matthew | 2018/08/19 11:51 AM |
Remote atomics | Michael S | 2018/08/19 12:58 PM |
Remote atomics | matthew | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:36 PM |
Remote atomics | matthew | 2018/08/19 01:48 PM |
Remote atomics | Michael S | 2018/08/19 02:16 PM |
Remote atomics | Ricardo B | 2018/08/20 09:05 AM |
Remote atomics | dmcq | 2018/08/19 01:33 PM |
Remote atomics | Travis | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:46 PM |
Remote atomics | Travis | 2018/08/19 04:35 PM |
Remote atomics | Michael S | 2018/08/20 02:29 AM |
Remote atomics | matthew | 2018/08/19 06:58 PM |
Remote atomics | anon | 2018/08/19 11:59 PM |
Remote atomics | Travis | 2018/08/20 09:26 AM |
Remote atomics | Travis | 2018/08/20 08:57 AM |
Remote atomics | Linus Torvalds | 2018/08/20 03:29 PM |
Fitting time slices to execution phases | Paul A. Clayton | 2018/08/21 08:09 AM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 01:34 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 02:31 PM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/21 02:54 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:26 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:21 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:39 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:59 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 04:13 PM |
Fitting time slices to execution phases | anon | 2018/08/21 03:27 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 05:02 PM |
Fitting time slices to execution phases | Etienne | 2018/08/22 01:28 AM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/22 02:07 PM |
Fitting time slices to execution phases | Travis | 2018/08/22 03:00 PM |
Fitting time slices to execution phases | anon | 2018/08/22 05:52 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:37 PM |
Is preventing misuse that complex? | Paul A. Clayton | 2018/08/23 04:42 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 11:46 AM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:29 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:33 PM |
Is preventing misuse that complex? | Jeff S. | 2018/08/24 06:57 AM |
Is preventing misuse that complex? | Travis | 2018/08/24 07:47 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 01:30 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 02:11 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:00 PM |
Is preventing misuse that complex? | Gabriele Svelto | 2018/08/24 12:25 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:33 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 02:54 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 02:18 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 02:59 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 03:27 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 04:10 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 05:21 PM |
ARM design houses | Michael S | 2018/08/21 04:07 AM |
ARM design houses | Wilco | 2018/08/22 11:38 AM |
ARM design houses | Michael S | 2018/08/22 01:21 PM |
ARM design houses | Wilco | 2018/08/22 02:23 PM |
ARM design houses | Michael S | 2018/08/29 12:58 AM |
Qualcomm's core naming scheme really, really sucks | Heikki Kultala | 2018/08/29 01:19 AM |
A76 | Maynard Handley | 2018/08/18 01:07 PM |
A76 | Michael S | 2018/08/18 01:32 PM |
A76 | Maynard Handley | 2018/08/18 01:52 PM |
A76 | Michael S | 2018/08/18 02:04 PM |
ARM is somewhat misleading | juanrga | 2018/08/17 12:20 AM |
Surprised?? | Alberto | 2018/08/17 12:52 AM |
Surprised?? | Alberto | 2018/08/17 01:10 AM |
Surprised?? | none | 2018/08/17 01:46 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 06:30 AM |
Garbage talk | Michael S | 2018/08/17 06:43 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:51 AM |
Garbage talk | Michael S | 2018/08/18 10:29 AM |
Garbage talk | Adrian | 2018/08/17 07:28 AM |
Garbage talk | Alberto | 2018/08/17 08:20 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:48 AM |
Garbage talk | Adrian | 2018/08/17 09:17 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 09:36 AM |
Garbage talk | Adrian | 2018/08/17 01:53 PM |
Garbage talk | Andrei Frumusanu | 2018/08/17 11:17 PM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 08:13 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 08:34 AM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 09:03 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 09:43 AM |
More like a religion he?? ARM has an easy life :) | Doug S | 2018/08/17 01:17 PM |
15W phone SoCs | AM | 2018/08/17 02:04 PM |
More like a religion he?? ARM has an easy life :) | Maynard Handley | 2018/08/17 11:29 AM |
my future stuff will be better than your old stuff, hey I'm a god at last (NT) | Eric Bron | 2018/08/18 02:34 AM |
my future stuff will be better than your old stuff, hey I'm a god at last | none | 2018/08/18 07:34 AM |