By: Travis (travis.downs.delete@this.gmail.com), August 19, 2018 1:32 pm
Room: Moderated Discussions
matthew (nobody.delete@this.example.com) on August 19, 2018 12:51 pm wrote:
> I think there's a good counter-example to your point, which is the Itanium McKinley implementation
> of fetchadd. If the line is in L1, the CPU does it. If it's not, the core tells the
> L2 to do it. Obviously no-one's citing Itanium as a good example of much these days,
> but I never heard anyone complain about its implementation of atomics.
As far as I can tell McKinley only ever had 1 core, right? So was this some kind of optimization for multiple socket machines? The line still had to get back to the core, since this is fetch and add, so you always suffer L2 latency?
Here the L2 is still private so it's kind of still a local operation. I guess the idea maybe is that atomic operations are a signal that a line might be probed by another socket so leaving the current value in the L2 reduces latency if that happens.
Keep in mind that this is also a quantitative tradeoff: atomic operations used to take 100+ cycles and so in that case doing something in the L2 might have been essentially free given the overall high cost of the operation.
Atomics now are down to maybe a dozen(ish) cycles on more quality implementations and are non-blocking (i.e., the core can order around them, and the 12-20 cycle latency is really a worst case where you are doing back-to-back operations which is not necessarily that common: so the real impact might be less).
>
> Let's talk about something that might make sense. A lock is typically on a cacheline which contains other
> data that is also being operated on. If the lock is uncontended (either no CPU has the cacheline or another
> CPU has the cacheline in an unlocked state), you want to bring in the cacheline. If the lock is contended,
> the last thing you want is to steal the cacheline from that CPU; at the very least it wants to access the
> cacheline again to release the lock, and chances are it's operating on other data in that cacheline.
Definitely, that's one way I can see this working: either with new instructions/hints for the cases the developer knows benefit from remote atomics, or some kind of prediction/dynamic behavior. The locking one is a good example: right now you have to steal the cache line only to find out that it's locked, maybe you could have a type of atomic that leaves the line in place if it is the locked state, since stealing it won't be useful anyways. Still, that's not really "remote atomics" because when you do get the lock you want to bring it into your cache so that unlock and subsequent lock/unlock pairs are really fast. It's more like an optimization of the RFO protocol to embed knowledge of the locking into the probe.
Stuff like that doesn't seem to popular though: there is a lot of probably low hanging fruit to really optimize locking: for example with some cooperation with the OS scheduler you could avoid switching out threads that were in the middle of short, contended critical sections since that ends causing a big caravan as all the other threads slam up against the lock before the owning thread is scheduled. It seems "easy" to do something like this in a fairly cheap way (e.g,. set a flag in the thread control block or something), maybe on an opt-in basis, but it hasn't happened.
The whole spin-locking idea could probably use some hardware optimization, like a userland MONITOR/MWAIT type of thing which can be efficiently use by the unlocking thread to signal a waiting thread without involving the OS, and without needing the cache line to be in shared state across all waiters, and then be invalidated on all of them, etc.
> So _if_ we have a way to indicate to the CPU that "this op is a lock acquire", then it might
> make sense to have a mechanism in the inter-core protocol to have the L2 cache steal the cacheline
> from the CPU, perform the operation, then _depending on the result of the operation_ either
> return the cacheline or report back to the requester the result of the operation.
Yeah, agreed.
> It's fiendishly complicated. It'd almost be better to have a way to transmit a series of operations
> and have the remote CPU execute them, because there isn't just one way to acquire a lock.
Sure, making it general is tough.
Then you also have to understand what you are saving. The current mechanism is that the waiting CPU sends a "message" (a probe) which is snooped by the CPU holding lock, which invalidates its copy of the line (without really slowing down), and then the line goes back to the waiting CPU, and after that it can perform local operations on it (spinning) until the holding CPU finally wants to exit the lock at which point it needs to steal the line back (and then finally the holding waiting CPU takes the line back one more time and this time probably gets the lock).
So it's only a few couple of messages, and all the loads done while spinning except for the initial and final ones occur locally and don't incur any concurrency traffic. Similarly, any number of other threads can participate in spinning and after the initial transfer also spin on their shared line locally. So it's not all that terrible.
Consider sending instructions to a remote CPU to execute. This has the downside of slowing down the very CPU that is holding the lock, which is the one agent in the system you don't want to slow down! Also, if getting the lock fails, you haven't brought the line back locally, so every time you want to check the lock again, you need to interrupt the lock-holding CPU. Add more cores and the problem gets worse.
So the way MESI (and enhancements) works isn't actually all that terrible even for locking scenarios, and it's no surprise that pretty much everyone uses it.
> Maynard's ideas are, as usual, poorly thought through.
I mean what really pissed me off is he's like "Oh, this is all Intel and Linux old-style never-innovate, stuck-in-the-mud type stuff" of course ignoring the fact that everyone implements it like this and his beloved Apple makes great CPUs by mostly leveraging all of the same really good tried and tested ideas.
> I think there's a good counter-example to your point, which is the Itanium McKinley implementation
> of fetchadd. If the line is in L1, the CPU does it. If it's not, the core tells the
> L2 to do it. Obviously no-one's citing Itanium as a good example of much these days,
> but I never heard anyone complain about its implementation of atomics.
As far as I can tell McKinley only ever had 1 core, right? So was this some kind of optimization for multiple socket machines? The line still had to get back to the core, since this is fetch and add, so you always suffer L2 latency?
Here the L2 is still private so it's kind of still a local operation. I guess the idea maybe is that atomic operations are a signal that a line might be probed by another socket so leaving the current value in the L2 reduces latency if that happens.
Keep in mind that this is also a quantitative tradeoff: atomic operations used to take 100+ cycles and so in that case doing something in the L2 might have been essentially free given the overall high cost of the operation.
Atomics now are down to maybe a dozen(ish) cycles on more quality implementations and are non-blocking (i.e., the core can order around them, and the 12-20 cycle latency is really a worst case where you are doing back-to-back operations which is not necessarily that common: so the real impact might be less).
>
> Let's talk about something that might make sense. A lock is typically on a cacheline which contains other
> data that is also being operated on. If the lock is uncontended (either no CPU has the cacheline or another
> CPU has the cacheline in an unlocked state), you want to bring in the cacheline. If the lock is contended,
> the last thing you want is to steal the cacheline from that CPU; at the very least it wants to access the
> cacheline again to release the lock, and chances are it's operating on other data in that cacheline.
Definitely, that's one way I can see this working: either with new instructions/hints for the cases the developer knows benefit from remote atomics, or some kind of prediction/dynamic behavior. The locking one is a good example: right now you have to steal the cache line only to find out that it's locked, maybe you could have a type of atomic that leaves the line in place if it is the locked state, since stealing it won't be useful anyways. Still, that's not really "remote atomics" because when you do get the lock you want to bring it into your cache so that unlock and subsequent lock/unlock pairs are really fast. It's more like an optimization of the RFO protocol to embed knowledge of the locking into the probe.
Stuff like that doesn't seem to popular though: there is a lot of probably low hanging fruit to really optimize locking: for example with some cooperation with the OS scheduler you could avoid switching out threads that were in the middle of short, contended critical sections since that ends causing a big caravan as all the other threads slam up against the lock before the owning thread is scheduled. It seems "easy" to do something like this in a fairly cheap way (e.g,. set a flag in the thread control block or something), maybe on an opt-in basis, but it hasn't happened.
The whole spin-locking idea could probably use some hardware optimization, like a userland MONITOR/MWAIT type of thing which can be efficiently use by the unlocking thread to signal a waiting thread without involving the OS, and without needing the cache line to be in shared state across all waiters, and then be invalidated on all of them, etc.
> So _if_ we have a way to indicate to the CPU that "this op is a lock acquire", then it might
> make sense to have a mechanism in the inter-core protocol to have the L2 cache steal the cacheline
> from the CPU, perform the operation, then _depending on the result of the operation_ either
> return the cacheline or report back to the requester the result of the operation.
Yeah, agreed.
> It's fiendishly complicated. It'd almost be better to have a way to transmit a series of operations
> and have the remote CPU execute them, because there isn't just one way to acquire a lock.
Sure, making it general is tough.
Then you also have to understand what you are saving. The current mechanism is that the waiting CPU sends a "message" (a probe) which is snooped by the CPU holding lock, which invalidates its copy of the line (without really slowing down), and then the line goes back to the waiting CPU, and after that it can perform local operations on it (spinning) until the holding CPU finally wants to exit the lock at which point it needs to steal the line back (and then finally the holding waiting CPU takes the line back one more time and this time probably gets the lock).
So it's only a few couple of messages, and all the loads done while spinning except for the initial and final ones occur locally and don't incur any concurrency traffic. Similarly, any number of other threads can participate in spinning and after the initial transfer also spin on their shared line locally. So it's not all that terrible.
Consider sending instructions to a remote CPU to execute. This has the downside of slowing down the very CPU that is holding the lock, which is the one agent in the system you don't want to slow down! Also, if getting the lock fails, you haven't brought the line back locally, so every time you want to check the lock again, you need to interrupt the lock-holding CPU. Add more cores and the problem gets worse.
So the way MESI (and enhancements) works isn't actually all that terrible even for locking scenarios, and it's no surprise that pretty much everyone uses it.
> Maynard's ideas are, as usual, poorly thought through.
I mean what really pissed me off is he's like "Oh, this is all Intel and Linux old-style never-innovate, stuck-in-the-mud type stuff" of course ignoring the fact that everyone implements it like this and his beloved Apple makes great CPUs by mostly leveraging all of the same really good tried and tested ideas.
Topic | Posted By | Date |
---|---|---|
ARM turns to a god and a hero | AM | 2018/08/16 08:32 AM |
ARM turns to a god and a hero | Maynard Handley | 2018/08/16 08:41 AM |
ARM turns to a god and a hero | Doug S | 2018/08/16 10:11 AM |
ARM turns to a god and a hero | Geoff Langdale | 2018/08/16 10:59 PM |
ARM turns to a god and a hero | dmcq | 2018/08/17 04:12 AM |
ARM is somewhat misleading | Adrian | 2018/08/16 10:56 PM |
It's marketing material | Gabriele Svelto | 2018/08/17 12:00 AM |
It's marketing material | Michael S | 2018/08/17 02:13 AM |
It's marketing material | dmcq | 2018/08/17 04:23 AM |
It's marketing material | Andrei Frumusanu | 2018/08/17 06:25 AM |
It's marketing material | Linus Torvalds | 2018/08/17 10:20 AM |
It's marketing material | Groo | 2018/08/17 12:44 PM |
It's marketing material | Doug S | 2018/08/17 01:14 PM |
promises and deliveries | AM | 2018/08/17 01:32 PM |
promises and deliveries | Passing Through | 2018/08/17 02:02 PM |
Just by way of clarification | Passing Through | 2018/08/17 02:15 PM |
Just by way of clarification | AM | 2018/08/18 11:49 AM |
Just by way of clarification | Passing Through | 2018/08/18 12:34 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 12:54 PM |
This ain't the nineties any longer | Maynard Handley | 2018/08/18 01:50 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 02:57 PM |
This ain't the nineties any longer | Passing Through | 2018/09/06 01:42 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 03:10 PM |
This ain't the nineties any longer | Passing Through | 2018/09/07 03:48 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 04:22 PM |
Just by way of clarification | Wilco | 2018/08/18 12:26 PM |
Just by way of clarification | Passing Through | 2018/08/18 12:39 PM |
Just by way of clarification | none | 2018/08/18 09:52 PM |
Just by way of clarification | dmcq | 2018/08/19 07:32 AM |
Just by way of clarification | none | 2018/08/19 07:54 AM |
Just by way of clarification | dmcq | 2018/08/19 10:24 AM |
Just by way of clarification | none | 2018/08/19 10:52 AM |
Just by way of clarification | Gabriele Svelto | 2018/08/19 05:41 AM |
Just by way of clarification | Passing Through | 2018/08/19 08:25 AM |
Whiteboards at Gatwick airport anyone? | Passing Through | 2018/08/20 03:24 AM |
It's marketing material | Michael S | 2018/08/18 10:12 AM |
It's marketing material | Brett | 2018/08/18 04:22 PM |
It's marketing material | Brett | 2018/08/18 04:33 PM |
It's marketing material | Adrian | 2018/08/19 12:21 AM |
A76 | AM | 2018/08/17 01:45 PM |
A76 | Michael S | 2018/08/18 10:20 AM |
A76 | AM | 2018/08/18 11:39 AM |
A76 | Michael S | 2018/08/18 11:49 AM |
A76 | AM | 2018/08/18 12:06 PM |
A76 | Doug S | 2018/08/18 12:43 PM |
A76 | Maynard Handley | 2018/08/18 01:42 PM |
A76 | Maynard Handley | 2018/08/18 03:22 PM |
Why write zeros when one can use metadata? | Paul A. Clayton | 2018/08/18 05:19 PM |
Why write zeros when one can use metadata? | Maynard Handley | 2018/08/19 10:12 AM |
Dictionary compress might apply to memcopy | Paul A. Clayton | 2018/08/19 12:45 PM |
Instructions for zeroing | Konrad Schwarz | 2018/08/30 05:37 AM |
Instructions for zeroing | Maynard Handley | 2018/08/30 07:41 AM |
Instructions for zeroing | Adrian | 2018/08/30 10:37 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | hobold | 2018/08/31 12:50 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | dmcq | 2018/09/01 04:28 AM |
A76 | Travis | 2018/08/19 10:36 AM |
A76 | Maynard Handley | 2018/08/19 11:22 AM |
A76 | Travis | 2018/08/19 01:07 PM |
A76 | Maynard Handley | 2018/08/19 05:24 PM |
Remote atomics | matthew | 2018/08/19 11:51 AM |
Remote atomics | Michael S | 2018/08/19 12:58 PM |
Remote atomics | matthew | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:36 PM |
Remote atomics | matthew | 2018/08/19 01:48 PM |
Remote atomics | Michael S | 2018/08/19 02:16 PM |
Remote atomics | Ricardo B | 2018/08/20 09:05 AM |
Remote atomics | dmcq | 2018/08/19 01:33 PM |
Remote atomics | Travis | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:46 PM |
Remote atomics | Travis | 2018/08/19 04:35 PM |
Remote atomics | Michael S | 2018/08/20 02:29 AM |
Remote atomics | matthew | 2018/08/19 06:58 PM |
Remote atomics | anon | 2018/08/19 11:59 PM |
Remote atomics | Travis | 2018/08/20 09:26 AM |
Remote atomics | Travis | 2018/08/20 08:57 AM |
Remote atomics | Linus Torvalds | 2018/08/20 03:29 PM |
Fitting time slices to execution phases | Paul A. Clayton | 2018/08/21 08:09 AM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 01:34 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 02:31 PM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/21 02:54 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:26 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:21 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:39 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:59 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 04:13 PM |
Fitting time slices to execution phases | anon | 2018/08/21 03:27 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 05:02 PM |
Fitting time slices to execution phases | Etienne | 2018/08/22 01:28 AM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/22 02:07 PM |
Fitting time slices to execution phases | Travis | 2018/08/22 03:00 PM |
Fitting time slices to execution phases | anon | 2018/08/22 05:52 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:37 PM |
Is preventing misuse that complex? | Paul A. Clayton | 2018/08/23 04:42 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 11:46 AM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:29 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:33 PM |
Is preventing misuse that complex? | Jeff S. | 2018/08/24 06:57 AM |
Is preventing misuse that complex? | Travis | 2018/08/24 07:47 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 01:30 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 02:11 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:00 PM |
Is preventing misuse that complex? | Gabriele Svelto | 2018/08/24 12:25 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:33 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 02:54 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 02:18 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 02:59 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 03:27 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 04:10 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 05:21 PM |
ARM design houses | Michael S | 2018/08/21 04:07 AM |
ARM design houses | Wilco | 2018/08/22 11:38 AM |
ARM design houses | Michael S | 2018/08/22 01:21 PM |
ARM design houses | Wilco | 2018/08/22 02:23 PM |
ARM design houses | Michael S | 2018/08/29 12:58 AM |
Qualcomm's core naming scheme really, really sucks | Heikki Kultala | 2018/08/29 01:19 AM |
A76 | Maynard Handley | 2018/08/18 01:07 PM |
A76 | Michael S | 2018/08/18 01:32 PM |
A76 | Maynard Handley | 2018/08/18 01:52 PM |
A76 | Michael S | 2018/08/18 02:04 PM |
ARM is somewhat misleading | juanrga | 2018/08/17 12:20 AM |
Surprised?? | Alberto | 2018/08/17 12:52 AM |
Surprised?? | Alberto | 2018/08/17 01:10 AM |
Surprised?? | none | 2018/08/17 01:46 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 06:30 AM |
Garbage talk | Michael S | 2018/08/17 06:43 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:51 AM |
Garbage talk | Michael S | 2018/08/18 10:29 AM |
Garbage talk | Adrian | 2018/08/17 07:28 AM |
Garbage talk | Alberto | 2018/08/17 08:20 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:48 AM |
Garbage talk | Adrian | 2018/08/17 09:17 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 09:36 AM |
Garbage talk | Adrian | 2018/08/17 01:53 PM |
Garbage talk | Andrei Frumusanu | 2018/08/17 11:17 PM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 08:13 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 08:34 AM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 09:03 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 09:43 AM |
More like a religion he?? ARM has an easy life :) | Doug S | 2018/08/17 01:17 PM |
15W phone SoCs | AM | 2018/08/17 02:04 PM |
More like a religion he?? ARM has an easy life :) | Maynard Handley | 2018/08/17 11:29 AM |
my future stuff will be better than your old stuff, hey I'm a god at last (NT) | Eric Bron | 2018/08/18 02:34 AM |
my future stuff will be better than your old stuff, hey I'm a god at last | none | 2018/08/18 07:34 AM |