By: matthew (nobody.delete@this.example.com), August 19, 2018 6:58 pm
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on August 19, 2018 2:32 pm wrote:
> matthew (nobody.delete@this.example.com) on August 19, 2018 12:51 pm wrote:
> > I think there's a good counter-example to your point, which is the Itanium McKinley implementation
> > of fetchadd. If the line is in L1, the CPU does it. If it's not, the core tells the
> > L2 to do it. Obviously no-one's citing Itanium as a good example of much these days,
> > but I never heard anyone complain about its implementation of atomics.
>
> As far as I can tell McKinley only ever had 1 core, right? So was this some kind
> of optimization for multiple socket machines? The line still had to get back to
> the core, since this is fetch and add, so you always suffer L2 latency?
Yes, the only dual-core-per-socket Itaniums was the Hondo hack which glued two Madisons into a single socket. They didn't even share L3 that I recall. But you have to remember that back in 2002, HP was all about 2-64 socket machines. They weren't optimising for workstations.
> > Let's talk about something that might make sense. A lock is typically on a cacheline which contains other
> > data that is also being operated on. If the lock is uncontended
> > (either no CPU has the cacheline or another
> > CPU has the cacheline in an unlocked state), you want to bring in the cacheline. If the lock is contended,
> > the last thing you want is to steal the cacheline from that CPU; at the very least it wants to access the
> > cacheline again to release the lock, and chances are it's operating on other data in that cacheline.
>
> Definitely, that's one way I can see this working: either with new instructions/hints for the cases the
> developer knows benefit from remote atomics, or some kind of prediction/dynamic behavior. The locking one
> is a good example: right now you have to steal the cache line only to find out that it's locked, maybe
> you could have a type of atomic that leaves the line in place if it is the locked state, since stealing
> it won't be useful anyways. Still, that's not really "remote atomics" because when you do get the lock
> you want to bring it into your cache so that unlock and subsequent lock/unlock pairs are really fast. It's
> more like an optimization of the RFO protocol to embed knowledge of the locking into the probe.
Yes, that's the general direction I'm thinking in.
> Stuff like that doesn't seem to popular though: there is a lot of probably low hanging fruit to
> really optimize locking: for example with some cooperation with the OS scheduler you could avoid
> switching out threads that were in the middle of short, contended critical sections since that ends
> causing a big caravan as all the other threads slam up against the lock before the owning thread
> is scheduled. It seems "easy" to do something like this in a fairly cheap way (e.g,. set a flag
> in the thread control block or something), maybe on an opt-in basis, but it hasn't happened.
Solaris had the ability for a thread to tell the kernel it was holding a mutex and that it shouldn't be preempted until it had dropped the lock. There have been several attempts to add something like that to Linux, but none have succeeded yet. Nothing with hardware assists either.
> > It's fiendishly complicated. It'd almost be better to have a way to transmit a series of operations
> > and have the remote CPU execute them, because there isn't just one way to acquire a lock.
>
> Sure, making it general is tough.
>
> Then you also have to understand what you are saving. The current mechanism is that the waiting
> CPU sends a "message" (a probe) which is snooped by the CPU holding lock, which invalidates its
> copy of the line (without really slowing down), and then the line goes back to the waiting CPU,
> and after that it can perform local operations on it (spinning) until the holding CPU finally
> wants to exit the lock at which point it needs to steal the line back (and then finally the holding
> waiting CPU takes the line back one more time and this time probably gets the lock).
Ah, but the bit you missed is that the cacheline with the lock on it probably also contains data being operated on by the lock owner. So you might well see a flood of coherency traffic as the cacheline transitions from exclusive -> shared -> exclusive -> shared due to all the other CPUs spinning on the cacheline while the owning CPU modifies the cacheline. I believe Intel CPUs (at least) have a hardware optimisation which makes locks more unfair to reduce this traffic, but I know no details.
> Consider sending instructions to a remote CPU to execute. This has the downside of slowing down the very
> CPU that is holding the lock, which is the one agent in the system you don't want to slow down! Also,
> if getting the lock fails, you haven't brought the line back locally, so every time you want to check the
> lock again, you need to interrupt the lock-holding CPU. Add more cores and the problem gets worse.
It depends what your instruction sequence is to acquire the lock. Linux is just "lock cmpxchg" for a spinlock acquire. That's actually fairly easy to handle ... the remote CPU knows whether the cmpxchg succeeded or not and can send the cacheline across:
0000000000000110 :
110: 31 c0 xor %eax,%eax
112: ba 01 00 00 00 mov $0x1,%edx
117: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
11b: 85 c0 test %eax,%eax
11d: 75 01 jne 120
11f: c3 retq
So the requesting CPU only has to send across the contents of %edx and the lock insn. The remote CPU only has to execute the cmpxchg (and that doesn't need much of an ALU!)
Now, that's the fast path. The slow path (ie where the jne leads) does this:
5: ba 01 00 00 00 mov $0x1,%edx
a: 8b 07 mov (%rdi),%eax
c: 85 c0 test %eax,%eax
e: 75 09 jne 19
10: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
14: 85 c0 test %eax,%eax
16: 75 f2 jne a
18: c3 retq
19: f3 90 pause
1b: eb ed jmp a
But if we're going all-out crazy redesign the world, maybe the remote CPU doesn't execute the lock cmpxchg. Maybe we just give the CPU designers one 32-bit word in our data structures and we just have a "spinlock" instruction which ends up being implemented as "the remote CPU does not return the cacheline until you won the lock", and the CPU can use the contents of that word for whatever fair queueing algorithm it wants to use.
> matthew (nobody.delete@this.example.com) on August 19, 2018 12:51 pm wrote:
> > I think there's a good counter-example to your point, which is the Itanium McKinley implementation
> > of fetchadd. If the line is in L1, the CPU does it. If it's not, the core tells the
> > L2 to do it. Obviously no-one's citing Itanium as a good example of much these days,
> > but I never heard anyone complain about its implementation of atomics.
>
> As far as I can tell McKinley only ever had 1 core, right? So was this some kind
> of optimization for multiple socket machines? The line still had to get back to
> the core, since this is fetch and add, so you always suffer L2 latency?
Yes, the only dual-core-per-socket Itaniums was the Hondo hack which glued two Madisons into a single socket. They didn't even share L3 that I recall. But you have to remember that back in 2002, HP was all about 2-64 socket machines. They weren't optimising for workstations.
> > Let's talk about something that might make sense. A lock is typically on a cacheline which contains other
> > data that is also being operated on. If the lock is uncontended
> > (either no CPU has the cacheline or another
> > CPU has the cacheline in an unlocked state), you want to bring in the cacheline. If the lock is contended,
> > the last thing you want is to steal the cacheline from that CPU; at the very least it wants to access the
> > cacheline again to release the lock, and chances are it's operating on other data in that cacheline.
>
> Definitely, that's one way I can see this working: either with new instructions/hints for the cases the
> developer knows benefit from remote atomics, or some kind of prediction/dynamic behavior. The locking one
> is a good example: right now you have to steal the cache line only to find out that it's locked, maybe
> you could have a type of atomic that leaves the line in place if it is the locked state, since stealing
> it won't be useful anyways. Still, that's not really "remote atomics" because when you do get the lock
> you want to bring it into your cache so that unlock and subsequent lock/unlock pairs are really fast. It's
> more like an optimization of the RFO protocol to embed knowledge of the locking into the probe.
Yes, that's the general direction I'm thinking in.
> Stuff like that doesn't seem to popular though: there is a lot of probably low hanging fruit to
> really optimize locking: for example with some cooperation with the OS scheduler you could avoid
> switching out threads that were in the middle of short, contended critical sections since that ends
> causing a big caravan as all the other threads slam up against the lock before the owning thread
> is scheduled. It seems "easy" to do something like this in a fairly cheap way (e.g,. set a flag
> in the thread control block or something), maybe on an opt-in basis, but it hasn't happened.
Solaris had the ability for a thread to tell the kernel it was holding a mutex and that it shouldn't be preempted until it had dropped the lock. There have been several attempts to add something like that to Linux, but none have succeeded yet. Nothing with hardware assists either.
> > It's fiendishly complicated. It'd almost be better to have a way to transmit a series of operations
> > and have the remote CPU execute them, because there isn't just one way to acquire a lock.
>
> Sure, making it general is tough.
>
> Then you also have to understand what you are saving. The current mechanism is that the waiting
> CPU sends a "message" (a probe) which is snooped by the CPU holding lock, which invalidates its
> copy of the line (without really slowing down), and then the line goes back to the waiting CPU,
> and after that it can perform local operations on it (spinning) until the holding CPU finally
> wants to exit the lock at which point it needs to steal the line back (and then finally the holding
> waiting CPU takes the line back one more time and this time probably gets the lock).
Ah, but the bit you missed is that the cacheline with the lock on it probably also contains data being operated on by the lock owner. So you might well see a flood of coherency traffic as the cacheline transitions from exclusive -> shared -> exclusive -> shared due to all the other CPUs spinning on the cacheline while the owning CPU modifies the cacheline. I believe Intel CPUs (at least) have a hardware optimisation which makes locks more unfair to reduce this traffic, but I know no details.
> Consider sending instructions to a remote CPU to execute. This has the downside of slowing down the very
> CPU that is holding the lock, which is the one agent in the system you don't want to slow down! Also,
> if getting the lock fails, you haven't brought the line back locally, so every time you want to check the
> lock again, you need to interrupt the lock-holding CPU. Add more cores and the problem gets worse.
It depends what your instruction sequence is to acquire the lock. Linux is just "lock cmpxchg" for a spinlock acquire. That's actually fairly easy to handle ... the remote CPU knows whether the cmpxchg succeeded or not and can send the cacheline across:
0000000000000110 :
110: 31 c0 xor %eax,%eax
112: ba 01 00 00 00 mov $0x1,%edx
117: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
11b: 85 c0 test %eax,%eax
11d: 75 01 jne 120
11f: c3 retq
So the requesting CPU only has to send across the contents of %edx and the lock insn. The remote CPU only has to execute the cmpxchg (and that doesn't need much of an ALU!)
Now, that's the fast path. The slow path (ie where the jne leads) does this:
5: ba 01 00 00 00 mov $0x1,%edx
a: 8b 07 mov (%rdi),%eax
c: 85 c0 test %eax,%eax
e: 75 09 jne 19
10: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
14: 85 c0 test %eax,%eax
16: 75 f2 jne a
18: c3 retq
19: f3 90 pause
1b: eb ed jmp a
But if we're going all-out crazy redesign the world, maybe the remote CPU doesn't execute the lock cmpxchg. Maybe we just give the CPU designers one 32-bit word in our data structures and we just have a "spinlock" instruction which ends up being implemented as "the remote CPU does not return the cacheline until you won the lock", and the CPU can use the contents of that word for whatever fair queueing algorithm it wants to use.
Topic | Posted By | Date |
---|---|---|
ARM turns to a god and a hero | AM | 2018/08/16 08:32 AM |
ARM turns to a god and a hero | Maynard Handley | 2018/08/16 08:41 AM |
ARM turns to a god and a hero | Doug S | 2018/08/16 10:11 AM |
ARM turns to a god and a hero | Geoff Langdale | 2018/08/16 10:59 PM |
ARM turns to a god and a hero | dmcq | 2018/08/17 04:12 AM |
ARM is somewhat misleading | Adrian | 2018/08/16 10:56 PM |
It's marketing material | Gabriele Svelto | 2018/08/17 12:00 AM |
It's marketing material | Michael S | 2018/08/17 02:13 AM |
It's marketing material | dmcq | 2018/08/17 04:23 AM |
It's marketing material | Andrei Frumusanu | 2018/08/17 06:25 AM |
It's marketing material | Linus Torvalds | 2018/08/17 10:20 AM |
It's marketing material | Groo | 2018/08/17 12:44 PM |
It's marketing material | Doug S | 2018/08/17 01:14 PM |
promises and deliveries | AM | 2018/08/17 01:32 PM |
promises and deliveries | Passing Through | 2018/08/17 02:02 PM |
Just by way of clarification | Passing Through | 2018/08/17 02:15 PM |
Just by way of clarification | AM | 2018/08/18 11:49 AM |
Just by way of clarification | Passing Through | 2018/08/18 12:34 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 12:54 PM |
This ain't the nineties any longer | Maynard Handley | 2018/08/18 01:50 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 02:57 PM |
This ain't the nineties any longer | Passing Through | 2018/09/06 01:42 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 03:10 PM |
This ain't the nineties any longer | Passing Through | 2018/09/07 03:48 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 04:22 PM |
Just by way of clarification | Wilco | 2018/08/18 12:26 PM |
Just by way of clarification | Passing Through | 2018/08/18 12:39 PM |
Just by way of clarification | none | 2018/08/18 09:52 PM |
Just by way of clarification | dmcq | 2018/08/19 07:32 AM |
Just by way of clarification | none | 2018/08/19 07:54 AM |
Just by way of clarification | dmcq | 2018/08/19 10:24 AM |
Just by way of clarification | none | 2018/08/19 10:52 AM |
Just by way of clarification | Gabriele Svelto | 2018/08/19 05:41 AM |
Just by way of clarification | Passing Through | 2018/08/19 08:25 AM |
Whiteboards at Gatwick airport anyone? | Passing Through | 2018/08/20 03:24 AM |
It's marketing material | Michael S | 2018/08/18 10:12 AM |
It's marketing material | Brett | 2018/08/18 04:22 PM |
It's marketing material | Brett | 2018/08/18 04:33 PM |
It's marketing material | Adrian | 2018/08/19 12:21 AM |
A76 | AM | 2018/08/17 01:45 PM |
A76 | Michael S | 2018/08/18 10:20 AM |
A76 | AM | 2018/08/18 11:39 AM |
A76 | Michael S | 2018/08/18 11:49 AM |
A76 | AM | 2018/08/18 12:06 PM |
A76 | Doug S | 2018/08/18 12:43 PM |
A76 | Maynard Handley | 2018/08/18 01:42 PM |
A76 | Maynard Handley | 2018/08/18 03:22 PM |
Why write zeros when one can use metadata? | Paul A. Clayton | 2018/08/18 05:19 PM |
Why write zeros when one can use metadata? | Maynard Handley | 2018/08/19 10:12 AM |
Dictionary compress might apply to memcopy | Paul A. Clayton | 2018/08/19 12:45 PM |
Instructions for zeroing | Konrad Schwarz | 2018/08/30 05:37 AM |
Instructions for zeroing | Maynard Handley | 2018/08/30 07:41 AM |
Instructions for zeroing | Adrian | 2018/08/30 10:37 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | hobold | 2018/08/31 12:50 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | dmcq | 2018/09/01 04:28 AM |
A76 | Travis | 2018/08/19 10:36 AM |
A76 | Maynard Handley | 2018/08/19 11:22 AM |
A76 | Travis | 2018/08/19 01:07 PM |
A76 | Maynard Handley | 2018/08/19 05:24 PM |
Remote atomics | matthew | 2018/08/19 11:51 AM |
Remote atomics | Michael S | 2018/08/19 12:58 PM |
Remote atomics | matthew | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:36 PM |
Remote atomics | matthew | 2018/08/19 01:48 PM |
Remote atomics | Michael S | 2018/08/19 02:16 PM |
Remote atomics | Ricardo B | 2018/08/20 09:05 AM |
Remote atomics | dmcq | 2018/08/19 01:33 PM |
Remote atomics | Travis | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:46 PM |
Remote atomics | Travis | 2018/08/19 04:35 PM |
Remote atomics | Michael S | 2018/08/20 02:29 AM |
Remote atomics | matthew | 2018/08/19 06:58 PM |
Remote atomics | anon | 2018/08/19 11:59 PM |
Remote atomics | Travis | 2018/08/20 09:26 AM |
Remote atomics | Travis | 2018/08/20 08:57 AM |
Remote atomics | Linus Torvalds | 2018/08/20 03:29 PM |
Fitting time slices to execution phases | Paul A. Clayton | 2018/08/21 08:09 AM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 01:34 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 02:31 PM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/21 02:54 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:26 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:21 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:39 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:59 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 04:13 PM |
Fitting time slices to execution phases | anon | 2018/08/21 03:27 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 05:02 PM |
Fitting time slices to execution phases | Etienne | 2018/08/22 01:28 AM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/22 02:07 PM |
Fitting time slices to execution phases | Travis | 2018/08/22 03:00 PM |
Fitting time slices to execution phases | anon | 2018/08/22 05:52 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:37 PM |
Is preventing misuse that complex? | Paul A. Clayton | 2018/08/23 04:42 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 11:46 AM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:29 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:33 PM |
Is preventing misuse that complex? | Jeff S. | 2018/08/24 06:57 AM |
Is preventing misuse that complex? | Travis | 2018/08/24 07:47 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 01:30 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 02:11 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:00 PM |
Is preventing misuse that complex? | Gabriele Svelto | 2018/08/24 12:25 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:33 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 02:54 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 02:18 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 02:59 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 03:27 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 04:10 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 05:21 PM |
ARM design houses | Michael S | 2018/08/21 04:07 AM |
ARM design houses | Wilco | 2018/08/22 11:38 AM |
ARM design houses | Michael S | 2018/08/22 01:21 PM |
ARM design houses | Wilco | 2018/08/22 02:23 PM |
ARM design houses | Michael S | 2018/08/29 12:58 AM |
Qualcomm's core naming scheme really, really sucks | Heikki Kultala | 2018/08/29 01:19 AM |
A76 | Maynard Handley | 2018/08/18 01:07 PM |
A76 | Michael S | 2018/08/18 01:32 PM |
A76 | Maynard Handley | 2018/08/18 01:52 PM |
A76 | Michael S | 2018/08/18 02:04 PM |
ARM is somewhat misleading | juanrga | 2018/08/17 12:20 AM |
Surprised?? | Alberto | 2018/08/17 12:52 AM |
Surprised?? | Alberto | 2018/08/17 01:10 AM |
Surprised?? | none | 2018/08/17 01:46 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 06:30 AM |
Garbage talk | Michael S | 2018/08/17 06:43 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:51 AM |
Garbage talk | Michael S | 2018/08/18 10:29 AM |
Garbage talk | Adrian | 2018/08/17 07:28 AM |
Garbage talk | Alberto | 2018/08/17 08:20 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:48 AM |
Garbage talk | Adrian | 2018/08/17 09:17 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 09:36 AM |
Garbage talk | Adrian | 2018/08/17 01:53 PM |
Garbage talk | Andrei Frumusanu | 2018/08/17 11:17 PM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 08:13 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 08:34 AM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 09:03 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 09:43 AM |
More like a religion he?? ARM has an easy life :) | Doug S | 2018/08/17 01:17 PM |
15W phone SoCs | AM | 2018/08/17 02:04 PM |
More like a religion he?? ARM has an easy life :) | Maynard Handley | 2018/08/17 11:29 AM |
my future stuff will be better than your old stuff, hey I'm a god at last (NT) | Eric Bron | 2018/08/18 02:34 AM |
my future stuff will be better than your old stuff, hey I'm a god at last | none | 2018/08/18 07:34 AM |