By: Maynard Handley (name99.delete@this.name99.org), August 19, 2018 5:24 pm
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on August 19, 2018 2:07 pm wrote:
> Maynard Handley (name99.delete@this.name99.org) on August 19, 2018 12:22 pm wrote:
> >
> > It depends on HOW your system is organized. Consider the current iPhone, which appears (as far as I can
> > tell) to have a low latency (12..15 cycle or so) large L2 that covers the large+small CPUs (but not the
> > GPU or IO). For that sort of system "near remote" atomics seem like a pretty good win for many purposes.
>
> Why are they a win? Even if you can do it all in L2, you are probably doubling the cost or worse.
>
> > Half of ARC references are remote only (increments); the other half depend on the
>
> You chose a convenient naming here: "remote only" - but you really mean "the result is not examined, right"?
> Why do it remotely when you could do it locally which is much faster? After you increment, you are probably
> decrementing most of the time pretty soon, so even more reason to bring it into the cache.
>
> > result (decrements, testing against zero). I don't know, given GCD usage these days,
> > how much transitioning there is between CPUs (including large to small CPUs).
>
> Probably less one out of a million decrements are done on a thread (not CPU) different than the
> one that incremented it. Most reference counts probably never exceed two and are never shared.
> That's just the overwhelming use of references/pointers/smart-pointers in every language.
>
> Of course sometimes you migrate CPUs, but then you are talking about the huge cost of moving over all
> the core-private caches (perhaps just L1 in the A* cores case), so having a few lines living in L2 helps
> pretty much not at all (logically, that "migrated cores" argument leads to getting rid of the L1 entirely
> and keeping everything in L2 just because of core migration - an obvious non-starter).
>
> Also, the reference count is presumably tightly packed with the object data itself, or other
> interesting, unshared data in the same cache line: so it's not like you can just point at
> the reference count and force it to live in L2, since it applies to the entire line.
>
> Note the level that sharing occurs at doesn't really matter much for this argument: closer to the core
> sharing makes "remote" atomics faster, but it also means the downside of the normal local atomics when
> actual sharing occurs is much less so the gain for remote in that scenario is correspondingly reduced.
>
> > I also don't know how much quasi-common "occasionally" updated
> > material (like comm-pages, and OS data structures)
> > exists that's best thought of and treated as living in common L2 rather than on a particular CPU.
>
> Practically nothing. Again you are begging the argument by implicitly linking "quasi-common occasionally
> updated material" with "best thought of and treated as living in common L2". That's the problem.
> Common, rarely updated data is best living right next to the core(s) that need it, just like
> all other data: duplicated in many L1s if needed. You don't screw the common case (loads of the
> shared lines directly from L1), just to speed up the uncommon update case.
You may be correct, I am suggesting this stuff as hypotheticals, not as definites.
However I am leery of the sort of arguments you are giving for the reasons I gave; I am quite willing to accept that many readers here are extreme experts on the best way of doing things FOR SPECIFIC TARGETS TODAY. That's rather different from the best way of doing different things, given different constraints.
To give just one example (different from what we are saying, but same point) if you were to describe a language like Objective C, to most of the denizens of this board, that's a horrible idea along multiple dimensions. Way too much time wasted on run-time resolution of indirect calls that "should" be resolved at compile time. And by some lights that's a good argument; but by other lights it's a terrible argument because it's optimizing for the wrong thing.
What WOULD I consider convincing data that remote atomics are a silly idea?
One possibility would be a design document from ARM describing why they added them, the problems they want to solve, and the problems they do NOT see them solving. Another would be internal discussions from Apple about what's expensive and what's not in the current OS+runtime. (Such discussions are rare but do occasionally surface; there was one late last year talking about what works well vs badly in GCD, and where that system should evolve to). Another would be academic papers talking about systems (real or simulated) that use remote atomics and where that is vs is not a good idea.
> Maynard Handley (name99.delete@this.name99.org) on August 19, 2018 12:22 pm wrote:
> >
> > It depends on HOW your system is organized. Consider the current iPhone, which appears (as far as I can
> > tell) to have a low latency (12..15 cycle or so) large L2 that covers the large+small CPUs (but not the
> > GPU or IO). For that sort of system "near remote" atomics seem like a pretty good win for many purposes.
>
> Why are they a win? Even if you can do it all in L2, you are probably doubling the cost or worse.
>
> > Half of ARC references are remote only (increments); the other half depend on the
>
> You chose a convenient naming here: "remote only" - but you really mean "the result is not examined, right"?
> Why do it remotely when you could do it locally which is much faster? After you increment, you are probably
> decrementing most of the time pretty soon, so even more reason to bring it into the cache.
>
> > result (decrements, testing against zero). I don't know, given GCD usage these days,
> > how much transitioning there is between CPUs (including large to small CPUs).
>
> Probably less one out of a million decrements are done on a thread (not CPU) different than the
> one that incremented it. Most reference counts probably never exceed two and are never shared.
> That's just the overwhelming use of references/pointers/smart-pointers in every language.
>
> Of course sometimes you migrate CPUs, but then you are talking about the huge cost of moving over all
> the core-private caches (perhaps just L1 in the A* cores case), so having a few lines living in L2 helps
> pretty much not at all (logically, that "migrated cores" argument leads to getting rid of the L1 entirely
> and keeping everything in L2 just because of core migration - an obvious non-starter).
>
> Also, the reference count is presumably tightly packed with the object data itself, or other
> interesting, unshared data in the same cache line: so it's not like you can just point at
> the reference count and force it to live in L2, since it applies to the entire line.
>
> Note the level that sharing occurs at doesn't really matter much for this argument: closer to the core
> sharing makes "remote" atomics faster, but it also means the downside of the normal local atomics when
> actual sharing occurs is much less so the gain for remote in that scenario is correspondingly reduced.
>
> > I also don't know how much quasi-common "occasionally" updated
> > material (like comm-pages, and OS data structures)
> > exists that's best thought of and treated as living in common L2 rather than on a particular CPU.
>
> Practically nothing. Again you are begging the argument by implicitly linking "quasi-common occasionally
> updated material" with "best thought of and treated as living in common L2". That's the problem.
> Common, rarely updated data is best living right next to the core(s) that need it, just like
> all other data: duplicated in many L1s if needed. You don't screw the common case (loads of the
> shared lines directly from L1), just to speed up the uncommon update case.
You may be correct, I am suggesting this stuff as hypotheticals, not as definites.
However I am leery of the sort of arguments you are giving for the reasons I gave; I am quite willing to accept that many readers here are extreme experts on the best way of doing things FOR SPECIFIC TARGETS TODAY. That's rather different from the best way of doing different things, given different constraints.
To give just one example (different from what we are saying, but same point) if you were to describe a language like Objective C, to most of the denizens of this board, that's a horrible idea along multiple dimensions. Way too much time wasted on run-time resolution of indirect calls that "should" be resolved at compile time. And by some lights that's a good argument; but by other lights it's a terrible argument because it's optimizing for the wrong thing.
What WOULD I consider convincing data that remote atomics are a silly idea?
One possibility would be a design document from ARM describing why they added them, the problems they want to solve, and the problems they do NOT see them solving. Another would be internal discussions from Apple about what's expensive and what's not in the current OS+runtime. (Such discussions are rare but do occasionally surface; there was one late last year talking about what works well vs badly in GCD, and where that system should evolve to). Another would be academic papers talking about systems (real or simulated) that use remote atomics and where that is vs is not a good idea.
Topic | Posted By | Date |
---|---|---|
ARM turns to a god and a hero | AM | 2018/08/16 08:32 AM |
ARM turns to a god and a hero | Maynard Handley | 2018/08/16 08:41 AM |
ARM turns to a god and a hero | Doug S | 2018/08/16 10:11 AM |
ARM turns to a god and a hero | Geoff Langdale | 2018/08/16 10:59 PM |
ARM turns to a god and a hero | dmcq | 2018/08/17 04:12 AM |
ARM is somewhat misleading | Adrian | 2018/08/16 10:56 PM |
It's marketing material | Gabriele Svelto | 2018/08/17 12:00 AM |
It's marketing material | Michael S | 2018/08/17 02:13 AM |
It's marketing material | dmcq | 2018/08/17 04:23 AM |
It's marketing material | Andrei Frumusanu | 2018/08/17 06:25 AM |
It's marketing material | Linus Torvalds | 2018/08/17 10:20 AM |
It's marketing material | Groo | 2018/08/17 12:44 PM |
It's marketing material | Doug S | 2018/08/17 01:14 PM |
promises and deliveries | AM | 2018/08/17 01:32 PM |
promises and deliveries | Passing Through | 2018/08/17 02:02 PM |
Just by way of clarification | Passing Through | 2018/08/17 02:15 PM |
Just by way of clarification | AM | 2018/08/18 11:49 AM |
Just by way of clarification | Passing Through | 2018/08/18 12:34 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 12:54 PM |
This ain't the nineties any longer | Maynard Handley | 2018/08/18 01:50 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 02:57 PM |
This ain't the nineties any longer | Passing Through | 2018/09/06 01:42 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 03:10 PM |
This ain't the nineties any longer | Passing Through | 2018/09/07 03:48 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 04:22 PM |
Just by way of clarification | Wilco | 2018/08/18 12:26 PM |
Just by way of clarification | Passing Through | 2018/08/18 12:39 PM |
Just by way of clarification | none | 2018/08/18 09:52 PM |
Just by way of clarification | dmcq | 2018/08/19 07:32 AM |
Just by way of clarification | none | 2018/08/19 07:54 AM |
Just by way of clarification | dmcq | 2018/08/19 10:24 AM |
Just by way of clarification | none | 2018/08/19 10:52 AM |
Just by way of clarification | Gabriele Svelto | 2018/08/19 05:41 AM |
Just by way of clarification | Passing Through | 2018/08/19 08:25 AM |
Whiteboards at Gatwick airport anyone? | Passing Through | 2018/08/20 03:24 AM |
It's marketing material | Michael S | 2018/08/18 10:12 AM |
It's marketing material | Brett | 2018/08/18 04:22 PM |
It's marketing material | Brett | 2018/08/18 04:33 PM |
It's marketing material | Adrian | 2018/08/19 12:21 AM |
A76 | AM | 2018/08/17 01:45 PM |
A76 | Michael S | 2018/08/18 10:20 AM |
A76 | AM | 2018/08/18 11:39 AM |
A76 | Michael S | 2018/08/18 11:49 AM |
A76 | AM | 2018/08/18 12:06 PM |
A76 | Doug S | 2018/08/18 12:43 PM |
A76 | Maynard Handley | 2018/08/18 01:42 PM |
A76 | Maynard Handley | 2018/08/18 03:22 PM |
Why write zeros when one can use metadata? | Paul A. Clayton | 2018/08/18 05:19 PM |
Why write zeros when one can use metadata? | Maynard Handley | 2018/08/19 10:12 AM |
Dictionary compress might apply to memcopy | Paul A. Clayton | 2018/08/19 12:45 PM |
Instructions for zeroing | Konrad Schwarz | 2018/08/30 05:37 AM |
Instructions for zeroing | Maynard Handley | 2018/08/30 07:41 AM |
Instructions for zeroing | Adrian | 2018/08/30 10:37 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | hobold | 2018/08/31 12:50 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | dmcq | 2018/09/01 04:28 AM |
A76 | Travis | 2018/08/19 10:36 AM |
A76 | Maynard Handley | 2018/08/19 11:22 AM |
A76 | Travis | 2018/08/19 01:07 PM |
A76 | Maynard Handley | 2018/08/19 05:24 PM |
Remote atomics | matthew | 2018/08/19 11:51 AM |
Remote atomics | Michael S | 2018/08/19 12:58 PM |
Remote atomics | matthew | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:36 PM |
Remote atomics | matthew | 2018/08/19 01:48 PM |
Remote atomics | Michael S | 2018/08/19 02:16 PM |
Remote atomics | Ricardo B | 2018/08/20 09:05 AM |
Remote atomics | dmcq | 2018/08/19 01:33 PM |
Remote atomics | Travis | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:46 PM |
Remote atomics | Travis | 2018/08/19 04:35 PM |
Remote atomics | Michael S | 2018/08/20 02:29 AM |
Remote atomics | matthew | 2018/08/19 06:58 PM |
Remote atomics | anon | 2018/08/19 11:59 PM |
Remote atomics | Travis | 2018/08/20 09:26 AM |
Remote atomics | Travis | 2018/08/20 08:57 AM |
Remote atomics | Linus Torvalds | 2018/08/20 03:29 PM |
Fitting time slices to execution phases | Paul A. Clayton | 2018/08/21 08:09 AM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 01:34 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 02:31 PM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/21 02:54 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:26 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:21 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:39 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:59 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 04:13 PM |
Fitting time slices to execution phases | anon | 2018/08/21 03:27 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 05:02 PM |
Fitting time slices to execution phases | Etienne | 2018/08/22 01:28 AM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/22 02:07 PM |
Fitting time slices to execution phases | Travis | 2018/08/22 03:00 PM |
Fitting time slices to execution phases | anon | 2018/08/22 05:52 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:37 PM |
Is preventing misuse that complex? | Paul A. Clayton | 2018/08/23 04:42 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 11:46 AM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:29 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:33 PM |
Is preventing misuse that complex? | Jeff S. | 2018/08/24 06:57 AM |
Is preventing misuse that complex? | Travis | 2018/08/24 07:47 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 01:30 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 02:11 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:00 PM |
Is preventing misuse that complex? | Gabriele Svelto | 2018/08/24 12:25 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:33 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 02:54 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 02:18 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 02:59 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 03:27 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 04:10 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 05:21 PM |
ARM design houses | Michael S | 2018/08/21 04:07 AM |
ARM design houses | Wilco | 2018/08/22 11:38 AM |
ARM design houses | Michael S | 2018/08/22 01:21 PM |
ARM design houses | Wilco | 2018/08/22 02:23 PM |
ARM design houses | Michael S | 2018/08/29 12:58 AM |
Qualcomm's core naming scheme really, really sucks | Heikki Kultala | 2018/08/29 01:19 AM |
A76 | Maynard Handley | 2018/08/18 01:07 PM |
A76 | Michael S | 2018/08/18 01:32 PM |
A76 | Maynard Handley | 2018/08/18 01:52 PM |
A76 | Michael S | 2018/08/18 02:04 PM |
ARM is somewhat misleading | juanrga | 2018/08/17 12:20 AM |
Surprised?? | Alberto | 2018/08/17 12:52 AM |
Surprised?? | Alberto | 2018/08/17 01:10 AM |
Surprised?? | none | 2018/08/17 01:46 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 06:30 AM |
Garbage talk | Michael S | 2018/08/17 06:43 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:51 AM |
Garbage talk | Michael S | 2018/08/18 10:29 AM |
Garbage talk | Adrian | 2018/08/17 07:28 AM |
Garbage talk | Alberto | 2018/08/17 08:20 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:48 AM |
Garbage talk | Adrian | 2018/08/17 09:17 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 09:36 AM |
Garbage talk | Adrian | 2018/08/17 01:53 PM |
Garbage talk | Andrei Frumusanu | 2018/08/17 11:17 PM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 08:13 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 08:34 AM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 09:03 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 09:43 AM |
More like a religion he?? ARM has an easy life :) | Doug S | 2018/08/17 01:17 PM |
15W phone SoCs | AM | 2018/08/17 02:04 PM |
More like a religion he?? ARM has an easy life :) | Maynard Handley | 2018/08/17 11:29 AM |
my future stuff will be better than your old stuff, hey I'm a god at last (NT) | Eric Bron | 2018/08/18 02:34 AM |
my future stuff will be better than your old stuff, hey I'm a god at last | none | 2018/08/18 07:34 AM |