By: matthew (nobody.delete@this.example.com), August 19, 2018 11:51 am
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on August 19, 2018 11:36 am wrote:
> Maynard Handley (name99.delete@this.name99.org) on August 18, 2018 2:42 pm wrote:
>
> > compressed pages), or remote atomics (to speed up common atomic operations like ARC increments);
>
> We've been over this before, but remote atomics make no sense for the vast majority of
> code. A "remote atomic" necessarily has a latency of whatever shared thing is remote and
> doing the increment, so in the range of 10s of clocks for something like L3, or in the
> 100s of clocks if it happens in memory (or a bit less if in the memory controller).
>
> Existing atomic operations are almost totally local operations, that work
> at "core speed", and don't require any coherence traffic at all (except when
> another core wants the line, just applies to non-atomics as well).
>
> The overwhelming majority of the time when something line is operated on atomically, the
> last core to use that line was the same core. Making this case fast is really important.
>
> Remote atomics is like throwing out the best case which almost always happens in favor of always worst-case
> behavior, just to make that worst-case behavior a little better. Now every atomic access in your
> program has to go out over long-latency, power sucking busses, and the core has to wait forever to
> get the result back (and it almost always does need the result, as in your ARC example).
>
> It just makes no sense as an implementation of existing atomics.
>
> Now there are some specific sharing patterns where it may work, but it would have to be a new instruction.
I think there's a good counter-example to your point, which is the Itanium McKinley implementation of fetchadd. If the line is in L1, the CPU does it. If it's not, the core tells the L2 to do it. Obviously no-one's citing Itanium as a good example of much these days, but I never heard anyone complain about its implementation of atomics.
Let's talk about something that might make sense. A lock is typically on a cacheline which contains other data that is also being operated on. If the lock is uncontended (either no CPU has the cacheline or another CPU has the cacheline in an unlocked state), you want to bring in the cacheline. If the lock is contended, the last thing you want is to steal the cacheline from that CPU; at the very least it wants to access the cacheline again to release the lock, and chances are it's operating on other data in that cacheline.
So _if_ we have a way to indicate to the CPU that "this op is a lock acquire", then it might make sense to have a mechanism in the inter-core protocol to have the L2 cache steal the cacheline from the CPU, perform the operation, then _depending on the result of the operation_ either return the cacheline or report back to the requester the result of the operation.
It's fiendishly complicated. It'd almost be better to have a way to transmit a series of operations and have the remote CPU execute them, because there isn't just one way to acquire a lock.
It's possible there are cases like "update global counter" that might be simpler to support than "acquire lock without troubling remote CPU too much".
Maynard's ideas are, as usual, poorly thought through.
> Maynard Handley (name99.delete@this.name99.org) on August 18, 2018 2:42 pm wrote:
>
> > compressed pages), or remote atomics (to speed up common atomic operations like ARC increments);
>
> We've been over this before, but remote atomics make no sense for the vast majority of
> code. A "remote atomic" necessarily has a latency of whatever shared thing is remote and
> doing the increment, so in the range of 10s of clocks for something like L3, or in the
> 100s of clocks if it happens in memory (or a bit less if in the memory controller).
>
> Existing atomic operations are almost totally local operations, that work
> at "core speed", and don't require any coherence traffic at all (except when
> another core wants the line, just applies to non-atomics as well).
>
> The overwhelming majority of the time when something line is operated on atomically, the
> last core to use that line was the same core. Making this case fast is really important.
>
> Remote atomics is like throwing out the best case which almost always happens in favor of always worst-case
> behavior, just to make that worst-case behavior a little better. Now every atomic access in your
> program has to go out over long-latency, power sucking busses, and the core has to wait forever to
> get the result back (and it almost always does need the result, as in your ARC example).
>
> It just makes no sense as an implementation of existing atomics.
>
> Now there are some specific sharing patterns where it may work, but it would have to be a new instruction.
I think there's a good counter-example to your point, which is the Itanium McKinley implementation of fetchadd. If the line is in L1, the CPU does it. If it's not, the core tells the L2 to do it. Obviously no-one's citing Itanium as a good example of much these days, but I never heard anyone complain about its implementation of atomics.
Let's talk about something that might make sense. A lock is typically on a cacheline which contains other data that is also being operated on. If the lock is uncontended (either no CPU has the cacheline or another CPU has the cacheline in an unlocked state), you want to bring in the cacheline. If the lock is contended, the last thing you want is to steal the cacheline from that CPU; at the very least it wants to access the cacheline again to release the lock, and chances are it's operating on other data in that cacheline.
So _if_ we have a way to indicate to the CPU that "this op is a lock acquire", then it might make sense to have a mechanism in the inter-core protocol to have the L2 cache steal the cacheline from the CPU, perform the operation, then _depending on the result of the operation_ either return the cacheline or report back to the requester the result of the operation.
It's fiendishly complicated. It'd almost be better to have a way to transmit a series of operations and have the remote CPU execute them, because there isn't just one way to acquire a lock.
It's possible there are cases like "update global counter" that might be simpler to support than "acquire lock without troubling remote CPU too much".
Maynard's ideas are, as usual, poorly thought through.
Topic | Posted By | Date |
---|---|---|
ARM turns to a god and a hero | AM | 2018/08/16 08:32 AM |
ARM turns to a god and a hero | Maynard Handley | 2018/08/16 08:41 AM |
ARM turns to a god and a hero | Doug S | 2018/08/16 10:11 AM |
ARM turns to a god and a hero | Geoff Langdale | 2018/08/16 10:59 PM |
ARM turns to a god and a hero | dmcq | 2018/08/17 04:12 AM |
ARM is somewhat misleading | Adrian | 2018/08/16 10:56 PM |
It's marketing material | Gabriele Svelto | 2018/08/17 12:00 AM |
It's marketing material | Michael S | 2018/08/17 02:13 AM |
It's marketing material | dmcq | 2018/08/17 04:23 AM |
It's marketing material | Andrei Frumusanu | 2018/08/17 06:25 AM |
It's marketing material | Linus Torvalds | 2018/08/17 10:20 AM |
It's marketing material | Groo | 2018/08/17 12:44 PM |
It's marketing material | Doug S | 2018/08/17 01:14 PM |
promises and deliveries | AM | 2018/08/17 01:32 PM |
promises and deliveries | Passing Through | 2018/08/17 02:02 PM |
Just by way of clarification | Passing Through | 2018/08/17 02:15 PM |
Just by way of clarification | AM | 2018/08/18 11:49 AM |
Just by way of clarification | Passing Through | 2018/08/18 12:34 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 12:54 PM |
This ain't the nineties any longer | Maynard Handley | 2018/08/18 01:50 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 02:57 PM |
This ain't the nineties any longer | Passing Through | 2018/09/06 01:42 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 03:10 PM |
This ain't the nineties any longer | Passing Through | 2018/09/07 03:48 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 04:22 PM |
Just by way of clarification | Wilco | 2018/08/18 12:26 PM |
Just by way of clarification | Passing Through | 2018/08/18 12:39 PM |
Just by way of clarification | none | 2018/08/18 09:52 PM |
Just by way of clarification | dmcq | 2018/08/19 07:32 AM |
Just by way of clarification | none | 2018/08/19 07:54 AM |
Just by way of clarification | dmcq | 2018/08/19 10:24 AM |
Just by way of clarification | none | 2018/08/19 10:52 AM |
Just by way of clarification | Gabriele Svelto | 2018/08/19 05:41 AM |
Just by way of clarification | Passing Through | 2018/08/19 08:25 AM |
Whiteboards at Gatwick airport anyone? | Passing Through | 2018/08/20 03:24 AM |
It's marketing material | Michael S | 2018/08/18 10:12 AM |
It's marketing material | Brett | 2018/08/18 04:22 PM |
It's marketing material | Brett | 2018/08/18 04:33 PM |
It's marketing material | Adrian | 2018/08/19 12:21 AM |
A76 | AM | 2018/08/17 01:45 PM |
A76 | Michael S | 2018/08/18 10:20 AM |
A76 | AM | 2018/08/18 11:39 AM |
A76 | Michael S | 2018/08/18 11:49 AM |
A76 | AM | 2018/08/18 12:06 PM |
A76 | Doug S | 2018/08/18 12:43 PM |
A76 | Maynard Handley | 2018/08/18 01:42 PM |
A76 | Maynard Handley | 2018/08/18 03:22 PM |
Why write zeros when one can use metadata? | Paul A. Clayton | 2018/08/18 05:19 PM |
Why write zeros when one can use metadata? | Maynard Handley | 2018/08/19 10:12 AM |
Dictionary compress might apply to memcopy | Paul A. Clayton | 2018/08/19 12:45 PM |
Instructions for zeroing | Konrad Schwarz | 2018/08/30 05:37 AM |
Instructions for zeroing | Maynard Handley | 2018/08/30 07:41 AM |
Instructions for zeroing | Adrian | 2018/08/30 10:37 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | hobold | 2018/08/31 12:50 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | dmcq | 2018/09/01 04:28 AM |
A76 | Travis | 2018/08/19 10:36 AM |
A76 | Maynard Handley | 2018/08/19 11:22 AM |
A76 | Travis | 2018/08/19 01:07 PM |
A76 | Maynard Handley | 2018/08/19 05:24 PM |
Remote atomics | matthew | 2018/08/19 11:51 AM |
Remote atomics | Michael S | 2018/08/19 12:58 PM |
Remote atomics | matthew | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:36 PM |
Remote atomics | matthew | 2018/08/19 01:48 PM |
Remote atomics | Michael S | 2018/08/19 02:16 PM |
Remote atomics | Ricardo B | 2018/08/20 09:05 AM |
Remote atomics | dmcq | 2018/08/19 01:33 PM |
Remote atomics | Travis | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:46 PM |
Remote atomics | Travis | 2018/08/19 04:35 PM |
Remote atomics | Michael S | 2018/08/20 02:29 AM |
Remote atomics | matthew | 2018/08/19 06:58 PM |
Remote atomics | anon | 2018/08/19 11:59 PM |
Remote atomics | Travis | 2018/08/20 09:26 AM |
Remote atomics | Travis | 2018/08/20 08:57 AM |
Remote atomics | Linus Torvalds | 2018/08/20 03:29 PM |
Fitting time slices to execution phases | Paul A. Clayton | 2018/08/21 08:09 AM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 01:34 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 02:31 PM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/21 02:54 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:26 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:21 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:39 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:59 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 04:13 PM |
Fitting time slices to execution phases | anon | 2018/08/21 03:27 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 05:02 PM |
Fitting time slices to execution phases | Etienne | 2018/08/22 01:28 AM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/22 02:07 PM |
Fitting time slices to execution phases | Travis | 2018/08/22 03:00 PM |
Fitting time slices to execution phases | anon | 2018/08/22 05:52 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:37 PM |
Is preventing misuse that complex? | Paul A. Clayton | 2018/08/23 04:42 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 11:46 AM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:29 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:33 PM |
Is preventing misuse that complex? | Jeff S. | 2018/08/24 06:57 AM |
Is preventing misuse that complex? | Travis | 2018/08/24 07:47 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 01:30 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 02:11 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:00 PM |
Is preventing misuse that complex? | Gabriele Svelto | 2018/08/24 12:25 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:33 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 02:54 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 02:18 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 02:59 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 03:27 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 04:10 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 05:21 PM |
ARM design houses | Michael S | 2018/08/21 04:07 AM |
ARM design houses | Wilco | 2018/08/22 11:38 AM |
ARM design houses | Michael S | 2018/08/22 01:21 PM |
ARM design houses | Wilco | 2018/08/22 02:23 PM |
ARM design houses | Michael S | 2018/08/29 12:58 AM |
Qualcomm's core naming scheme really, really sucks | Heikki Kultala | 2018/08/29 01:19 AM |
A76 | Maynard Handley | 2018/08/18 01:07 PM |
A76 | Michael S | 2018/08/18 01:32 PM |
A76 | Maynard Handley | 2018/08/18 01:52 PM |
A76 | Michael S | 2018/08/18 02:04 PM |
ARM is somewhat misleading | juanrga | 2018/08/17 12:20 AM |
Surprised?? | Alberto | 2018/08/17 12:52 AM |
Surprised?? | Alberto | 2018/08/17 01:10 AM |
Surprised?? | none | 2018/08/17 01:46 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 06:30 AM |
Garbage talk | Michael S | 2018/08/17 06:43 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:51 AM |
Garbage talk | Michael S | 2018/08/18 10:29 AM |
Garbage talk | Adrian | 2018/08/17 07:28 AM |
Garbage talk | Alberto | 2018/08/17 08:20 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:48 AM |
Garbage talk | Adrian | 2018/08/17 09:17 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 09:36 AM |
Garbage talk | Adrian | 2018/08/17 01:53 PM |
Garbage talk | Andrei Frumusanu | 2018/08/17 11:17 PM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 08:13 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 08:34 AM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 09:03 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 09:43 AM |
More like a religion he?? ARM has an easy life :) | Doug S | 2018/08/17 01:17 PM |
15W phone SoCs | AM | 2018/08/17 02:04 PM |
More like a religion he?? ARM has an easy life :) | Maynard Handley | 2018/08/17 11:29 AM |
my future stuff will be better than your old stuff, hey I'm a god at last (NT) | Eric Bron | 2018/08/18 02:34 AM |
my future stuff will be better than your old stuff, hey I'm a god at last | none | 2018/08/18 07:34 AM |