By: Travis (travis.downs.delete@this.gmail.com), August 20, 2018 8:57 am
Room: Moderated Discussions
matthew (nobody.delete@this.example.com) on August 19, 2018 7:58 pm wrote:
> Travis (travis.downs.delete@this.gmail.com) on August 19, 2018 2:32 pm wrote:
> > Then you also have to understand what you are saving. The current mechanism is that the waiting
> > CPU sends a "message" (a probe) which is snooped by the CPU holding lock, which invalidates its
> > copy of the line (without really slowing down), and then the line goes back to the waiting CPU,
> > and after that it can perform local operations on it (spinning) until the holding CPU finally
> > wants to exit the lock at which point it needs to steal the line back (and then finally the holding
> > waiting CPU takes the line back one more time and this time probably gets the lock).
>
> Ah, but the bit you missed is that the cacheline with the lock on it probably also contains data being
> operated on by the lock owner. So you might well see a flood of coherency traffic as the cacheline
> transitions from exclusive -> shared -> exclusive -> shared due to all the other CPUs spinning on the
> cacheline while the owning CPU modifies the cacheline.
Yup, I saw that, but it seems like a mis-designed software in that case. In the case your critical section is very short, sure you want to co-locate the data and lock so you can get the line, lock, do your work and unlock, all very quickly and likely before any other core steals the line, except in extreme contention.
In your scenario though the lock is held for a while while the lock holder does work and there is enough contention other threads are spinning on the lock. The best options are to reduce the contention or holding period, but if you can't then in this scenario the lock and data should go on separate lines: let the holding core do its work!
> I believe Intel CPUs (at least) have a hardware
> optimisation which makes locks more unfair to reduce this traffic, but I know no details.
Well lock "fairness" is mostly a consequence of software design, but at a lower level the hardware arbitration for contented cache lines plays a role too: but my impression is that this is largely fair: at least in the sense that line owners always given up their lines essentially immediately when probes occur, and there is also a mechanism to ensure that under contention from multiple cores the line goes around fairly evenly between cores with no starvation (this applies at the line-transfer level only: of course in software you can still experience starvation depending on exactly what you need to do to make progress).
> > Consider sending instructions to a remote CPU to execute. This has the downside of slowing down the very
> > CPU that is holding the lock, which is the one agent in the system you don't want to slow down! Also,
> > if getting the lock fails, you haven't brought the line back locally, so every time you want to check the
> > lock again, you need to interrupt the lock-holding CPU. Add more cores and the problem gets worse.
> It depends what your instruction sequence is to acquire the lock. Linux is just "lock
> cmpxchg" for a spinlock acquire. That's actually fairly easy to handle ... the remote
> CPU knows whether the cmpxchg succeeded or not and can send the cacheline across:
> ...
> So the requesting CPU only has to send across the contents of %edx and the lock insn.
> The remote CPU only has to execute the cmpxchg (and that doesn't need much of an ALU!)
Yes, I know that a spinlock is very simple in software, but I wasn't saying the lock itself was complex. I was saying the process of sending any instructions to another core to execute on behalf of the sender is complex! Consider that the other core with its giant pipeline full of 100+ instructions in flight, and store buffers full of unretired, retired and senior stores now has to accommodate an external to compute some value atomically, playing by all the memory ordering rules. Consider even the complexity of dealing with the fact that the cores might have totally different page tables (e.g., CR3 values).
The main problem was though that it doesn't scale well: to "spin" every spinning core needs to continually send these remote messages every time it wants to check the lock. It changed the common operation from local to remote.
> But if we're going all-out crazy redesign the world, maybe the remote CPU doesn't execute the lock cmpxchg.
> Maybe we just give the CPU designers one 32-bit word in our data structures and we just have a "spinlock" instruction
> which ends up being implemented as "the remote CPU does not return the cacheline until you won the lock", and
> the CPU can use the contents of that word for whatever fair queueing algorithm it wants to use.
Yeah you do kind of want a user-mode MONITOR/MWAIT for effect spin-free lock handoff. Fair handoff-based locks are pretty unpopular though. Maybe you could abuse XBEGIN/XABORT for it as kind of a message passing/cheap user-mode IPI.
> Travis (travis.downs.delete@this.gmail.com) on August 19, 2018 2:32 pm wrote:
> > Then you also have to understand what you are saving. The current mechanism is that the waiting
> > CPU sends a "message" (a probe) which is snooped by the CPU holding lock, which invalidates its
> > copy of the line (without really slowing down), and then the line goes back to the waiting CPU,
> > and after that it can perform local operations on it (spinning) until the holding CPU finally
> > wants to exit the lock at which point it needs to steal the line back (and then finally the holding
> > waiting CPU takes the line back one more time and this time probably gets the lock).
>
> Ah, but the bit you missed is that the cacheline with the lock on it probably also contains data being
> operated on by the lock owner. So you might well see a flood of coherency traffic as the cacheline
> transitions from exclusive -> shared -> exclusive -> shared due to all the other CPUs spinning on the
> cacheline while the owning CPU modifies the cacheline.
Yup, I saw that, but it seems like a mis-designed software in that case. In the case your critical section is very short, sure you want to co-locate the data and lock so you can get the line, lock, do your work and unlock, all very quickly and likely before any other core steals the line, except in extreme contention.
In your scenario though the lock is held for a while while the lock holder does work and there is enough contention other threads are spinning on the lock. The best options are to reduce the contention or holding period, but if you can't then in this scenario the lock and data should go on separate lines: let the holding core do its work!
> I believe Intel CPUs (at least) have a hardware
> optimisation which makes locks more unfair to reduce this traffic, but I know no details.
Well lock "fairness" is mostly a consequence of software design, but at a lower level the hardware arbitration for contented cache lines plays a role too: but my impression is that this is largely fair: at least in the sense that line owners always given up their lines essentially immediately when probes occur, and there is also a mechanism to ensure that under contention from multiple cores the line goes around fairly evenly between cores with no starvation (this applies at the line-transfer level only: of course in software you can still experience starvation depending on exactly what you need to do to make progress).
> > Consider sending instructions to a remote CPU to execute. This has the downside of slowing down the very
> > CPU that is holding the lock, which is the one agent in the system you don't want to slow down! Also,
> > if getting the lock fails, you haven't brought the line back locally, so every time you want to check the
> > lock again, you need to interrupt the lock-holding CPU. Add more cores and the problem gets worse.
> It depends what your instruction sequence is to acquire the lock. Linux is just "lock
> cmpxchg" for a spinlock acquire. That's actually fairly easy to handle ... the remote
> CPU knows whether the cmpxchg succeeded or not and can send the cacheline across:
> ...
> So the requesting CPU only has to send across the contents of %edx and the lock insn.
> The remote CPU only has to execute the cmpxchg (and that doesn't need much of an ALU!)
Yes, I know that a spinlock is very simple in software, but I wasn't saying the lock itself was complex. I was saying the process of sending any instructions to another core to execute on behalf of the sender is complex! Consider that the other core with its giant pipeline full of 100+ instructions in flight, and store buffers full of unretired, retired and senior stores now has to accommodate an external to compute some value atomically, playing by all the memory ordering rules. Consider even the complexity of dealing with the fact that the cores might have totally different page tables (e.g., CR3 values).
The main problem was though that it doesn't scale well: to "spin" every spinning core needs to continually send these remote messages every time it wants to check the lock. It changed the common operation from local to remote.
> But if we're going all-out crazy redesign the world, maybe the remote CPU doesn't execute the lock cmpxchg.
> Maybe we just give the CPU designers one 32-bit word in our data structures and we just have a "spinlock" instruction
> which ends up being implemented as "the remote CPU does not return the cacheline until you won the lock", and
> the CPU can use the contents of that word for whatever fair queueing algorithm it wants to use.
Yeah you do kind of want a user-mode MONITOR/MWAIT for effect spin-free lock handoff. Fair handoff-based locks are pretty unpopular though. Maybe you could abuse XBEGIN/XABORT for it as kind of a message passing/cheap user-mode IPI.
Topic | Posted By | Date |
---|---|---|
ARM turns to a god and a hero | AM | 2018/08/16 08:32 AM |
ARM turns to a god and a hero | Maynard Handley | 2018/08/16 08:41 AM |
ARM turns to a god and a hero | Doug S | 2018/08/16 10:11 AM |
ARM turns to a god and a hero | Geoff Langdale | 2018/08/16 10:59 PM |
ARM turns to a god and a hero | dmcq | 2018/08/17 04:12 AM |
ARM is somewhat misleading | Adrian | 2018/08/16 10:56 PM |
It's marketing material | Gabriele Svelto | 2018/08/17 12:00 AM |
It's marketing material | Michael S | 2018/08/17 02:13 AM |
It's marketing material | dmcq | 2018/08/17 04:23 AM |
It's marketing material | Andrei Frumusanu | 2018/08/17 06:25 AM |
It's marketing material | Linus Torvalds | 2018/08/17 10:20 AM |
It's marketing material | Groo | 2018/08/17 12:44 PM |
It's marketing material | Doug S | 2018/08/17 01:14 PM |
promises and deliveries | AM | 2018/08/17 01:32 PM |
promises and deliveries | Passing Through | 2018/08/17 02:02 PM |
Just by way of clarification | Passing Through | 2018/08/17 02:15 PM |
Just by way of clarification | AM | 2018/08/18 11:49 AM |
Just by way of clarification | Passing Through | 2018/08/18 12:34 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 12:54 PM |
This ain't the nineties any longer | Maynard Handley | 2018/08/18 01:50 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 02:57 PM |
This ain't the nineties any longer | Passing Through | 2018/09/06 01:42 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 03:10 PM |
This ain't the nineties any longer | Passing Through | 2018/09/07 03:48 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 04:22 PM |
Just by way of clarification | Wilco | 2018/08/18 12:26 PM |
Just by way of clarification | Passing Through | 2018/08/18 12:39 PM |
Just by way of clarification | none | 2018/08/18 09:52 PM |
Just by way of clarification | dmcq | 2018/08/19 07:32 AM |
Just by way of clarification | none | 2018/08/19 07:54 AM |
Just by way of clarification | dmcq | 2018/08/19 10:24 AM |
Just by way of clarification | none | 2018/08/19 10:52 AM |
Just by way of clarification | Gabriele Svelto | 2018/08/19 05:41 AM |
Just by way of clarification | Passing Through | 2018/08/19 08:25 AM |
Whiteboards at Gatwick airport anyone? | Passing Through | 2018/08/20 03:24 AM |
It's marketing material | Michael S | 2018/08/18 10:12 AM |
It's marketing material | Brett | 2018/08/18 04:22 PM |
It's marketing material | Brett | 2018/08/18 04:33 PM |
It's marketing material | Adrian | 2018/08/19 12:21 AM |
A76 | AM | 2018/08/17 01:45 PM |
A76 | Michael S | 2018/08/18 10:20 AM |
A76 | AM | 2018/08/18 11:39 AM |
A76 | Michael S | 2018/08/18 11:49 AM |
A76 | AM | 2018/08/18 12:06 PM |
A76 | Doug S | 2018/08/18 12:43 PM |
A76 | Maynard Handley | 2018/08/18 01:42 PM |
A76 | Maynard Handley | 2018/08/18 03:22 PM |
Why write zeros when one can use metadata? | Paul A. Clayton | 2018/08/18 05:19 PM |
Why write zeros when one can use metadata? | Maynard Handley | 2018/08/19 10:12 AM |
Dictionary compress might apply to memcopy | Paul A. Clayton | 2018/08/19 12:45 PM |
Instructions for zeroing | Konrad Schwarz | 2018/08/30 05:37 AM |
Instructions for zeroing | Maynard Handley | 2018/08/30 07:41 AM |
Instructions for zeroing | Adrian | 2018/08/30 10:37 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | hobold | 2018/08/31 12:50 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | dmcq | 2018/09/01 04:28 AM |
A76 | Travis | 2018/08/19 10:36 AM |
A76 | Maynard Handley | 2018/08/19 11:22 AM |
A76 | Travis | 2018/08/19 01:07 PM |
A76 | Maynard Handley | 2018/08/19 05:24 PM |
Remote atomics | matthew | 2018/08/19 11:51 AM |
Remote atomics | Michael S | 2018/08/19 12:58 PM |
Remote atomics | matthew | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:36 PM |
Remote atomics | matthew | 2018/08/19 01:48 PM |
Remote atomics | Michael S | 2018/08/19 02:16 PM |
Remote atomics | Ricardo B | 2018/08/20 09:05 AM |
Remote atomics | dmcq | 2018/08/19 01:33 PM |
Remote atomics | Travis | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:46 PM |
Remote atomics | Travis | 2018/08/19 04:35 PM |
Remote atomics | Michael S | 2018/08/20 02:29 AM |
Remote atomics | matthew | 2018/08/19 06:58 PM |
Remote atomics | anon | 2018/08/19 11:59 PM |
Remote atomics | Travis | 2018/08/20 09:26 AM |
Remote atomics | Travis | 2018/08/20 08:57 AM |
Remote atomics | Linus Torvalds | 2018/08/20 03:29 PM |
Fitting time slices to execution phases | Paul A. Clayton | 2018/08/21 08:09 AM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 01:34 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 02:31 PM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/21 02:54 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:26 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:21 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:39 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:59 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 04:13 PM |
Fitting time slices to execution phases | anon | 2018/08/21 03:27 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 05:02 PM |
Fitting time slices to execution phases | Etienne | 2018/08/22 01:28 AM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/22 02:07 PM |
Fitting time slices to execution phases | Travis | 2018/08/22 03:00 PM |
Fitting time slices to execution phases | anon | 2018/08/22 05:52 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:37 PM |
Is preventing misuse that complex? | Paul A. Clayton | 2018/08/23 04:42 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 11:46 AM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:29 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:33 PM |
Is preventing misuse that complex? | Jeff S. | 2018/08/24 06:57 AM |
Is preventing misuse that complex? | Travis | 2018/08/24 07:47 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 01:30 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 02:11 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:00 PM |
Is preventing misuse that complex? | Gabriele Svelto | 2018/08/24 12:25 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:33 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 02:54 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 02:18 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 02:59 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 03:27 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 04:10 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 05:21 PM |
ARM design houses | Michael S | 2018/08/21 04:07 AM |
ARM design houses | Wilco | 2018/08/22 11:38 AM |
ARM design houses | Michael S | 2018/08/22 01:21 PM |
ARM design houses | Wilco | 2018/08/22 02:23 PM |
ARM design houses | Michael S | 2018/08/29 12:58 AM |
Qualcomm's core naming scheme really, really sucks | Heikki Kultala | 2018/08/29 01:19 AM |
A76 | Maynard Handley | 2018/08/18 01:07 PM |
A76 | Michael S | 2018/08/18 01:32 PM |
A76 | Maynard Handley | 2018/08/18 01:52 PM |
A76 | Michael S | 2018/08/18 02:04 PM |
ARM is somewhat misleading | juanrga | 2018/08/17 12:20 AM |
Surprised?? | Alberto | 2018/08/17 12:52 AM |
Surprised?? | Alberto | 2018/08/17 01:10 AM |
Surprised?? | none | 2018/08/17 01:46 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 06:30 AM |
Garbage talk | Michael S | 2018/08/17 06:43 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:51 AM |
Garbage talk | Michael S | 2018/08/18 10:29 AM |
Garbage talk | Adrian | 2018/08/17 07:28 AM |
Garbage talk | Alberto | 2018/08/17 08:20 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:48 AM |
Garbage talk | Adrian | 2018/08/17 09:17 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 09:36 AM |
Garbage talk | Adrian | 2018/08/17 01:53 PM |
Garbage talk | Andrei Frumusanu | 2018/08/17 11:17 PM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 08:13 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 08:34 AM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 09:03 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 09:43 AM |
More like a religion he?? ARM has an easy life :) | Doug S | 2018/08/17 01:17 PM |
15W phone SoCs | AM | 2018/08/17 02:04 PM |
More like a religion he?? ARM has an easy life :) | Maynard Handley | 2018/08/17 11:29 AM |
my future stuff will be better than your old stuff, hey I'm a god at last (NT) | Eric Bron | 2018/08/18 02:34 AM |
my future stuff will be better than your old stuff, hey I'm a god at last | none | 2018/08/18 07:34 AM |