By: Maynard Handley (name99.delete@this.name99.org), August 18, 2018 1:42 pm
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on August 18, 2018 1:43 pm wrote:
> AM (myname4rwt.delete@this.jee-male.com) on August 18, 2018 1:06 pm wrote:
> > > > What they promised was 40% higher perf in the same power envelope as A75. And that for A76 on 7nm vs A75
> > > > on 10nm.
> > >
> > > Considering that the difference between 7nm and 10nm is
> > > less than 40% it does imply improved uArch efficiency.
> >
> > This is not true. E.g. TSMC has exactly 40% power reduction in its 7nm tech vs 10nm.
>
>
> TSMC promising 40% power reduction in 7nm versus 10nm does not imply anything about performance. They say "20%
> better performance, 40% better power" but those types of things have always been one or the other in the past.
> i.e. you can either get the same performance at 40% less power or 20% more performance at the same power.
>
> If ARM is promising 40% more performance, and get 20% from the process due to faster transistor switching speeds,
> they have to get the other 20% from greater IPC, assuming they aren't also increasing the power budget.
>
> Of course, it would be helpful to know what ARM uses to do its measurements. It is possible to achieve huge improvements
> in SIMD heavy code by allowing the issue/retirement of more SIMD instructions per cycle. If you use (or assume)
> faster or lower latency RAM, you can get big improvements in FP. Getting improvements in twisty branchy integer
> code is the most difficult since it isn't just one change but a lot of them to get there. But presumably that's
> what they are targeting since SIMD and FP are less important on phones than integer code.
>
> There's obviously ample room for ARM to make such improvements. Looking at the Galaxy S9 version using the
> SD845 w/A75 @ 2.8 GHz versus the iPhone X using the A11 @ 2.39 GHz, despite the 17% clock rate handicap the
> iPhone is ~70% faster on GB4 integer, and that holds even if you focus in on the more twisty branchy subtests
> like LLVM or HTML5 DOM. So basically double the IPC. What's surprising is not that ARM is claiming a 20% improvement,
> and further improvements in the future. What's surprising is that they fell that far behind.
>
> So I don't think there's any reason to be skeptical about ARM getting a 20% improvement here, though
> that still leaves them at minimum 50% behind Apple - since Apple will get the same 20% improvement from
> TSMC, and presumably makes further IPC improvements in the A12. Very significant improvements, if you
> believe Charlie's claims about the A12 of 50% greater single thread performance... I'm kinda skeptical
> of that myself, but I guess we'll see next month. Almost makes me wonder if the numbers he saw were benchmarks
> of an A12X going into an iPad Pro, or possibly even a prototype of something targeted at laptops. 50%
> higher when they are only getting 20% from process would mean another huge IPC gain.
Is the 50% outrageous? Well who knows. But
- if you like to go by pattern matching (insofar as that describes what different teams are doing and their schedules) we had something like
- A6 was "safe"
- A7 was great leap forward
- A8 was safe tweaks+improvements
- A9 was great leap forward
- A10, A11 were safe tweaks and improvements
- meaning if there are multiple teams, one of which is the "superstar team" tasked with great leaps forward, it might be time for their next product?
As for whether it's possible, I'd argue it absolutely is. One problem, of course, is that we don't know what Apple has ALREADY implemented to get them to 50% higher IPC than Intel, but the copious pool of KIP enhancements seems so far to be only lightly tapped (and most of these ideas are orthogonal and can be implemented one at a time, not all at once).
So there's scope for a lot less dead time waiting on data.
For waiting on instructions, we've known for years that you can reduce I-cache waits to basically nothing (except for particular types of server code) through things like
- ONLY train the pre-fetchers on RETIRED "app" instructions. Don't allow interrupts, or mispredicted instructions, to pollute your prefetchers (and ideally do something like mark those lines as LRU rather than MRU)
- code in L2 is more valuable than data in L2, so give it a better chance of surviving there, through various means (this can be through placement/replacement priorities, or explicitly flagging it as I-lines)
Finally for actual execution, there's scope for a nice win (for many purposes) by aggressive use of
- ALU instructions that feed into the same variable (something like R3=R4+R5 AND R6)
- compiled to use the same intermediate register and placed back to back so
R3= R5 & R6
R3= R3 + R4
- fused in decode
- sent through a double-pumped ALU that can do the expected "easy" ops (+,-, logicals, MAYBE small shifts?) in a half cycle ala P4
This doesn't cover everything, obviously, but it gets a fraction of your code resolving dependencies at twice the nominal frequency, without much real cost.
Of course for optimal value this means you need to expand decode proper maybe from 6 to 8 instructions per cycle, and you need an extremely aggressive fetch system; but both of these are feasible with every instruction as 4bytes wide and no need to care about ARMv7 and THUMB any more.
There's also a whole bunch of interesting work on criticality analysis that basically allows you to get all the benefit of OoO and speculation where it's valuable, while not paying that cost (in power and in mispeculation+recovery) for the (surprisingly common) case of non-critical instructions. This looks to me like very useful work that's not been appreciated enough in academia, and that has not (yet...?) had commercial implementation.
There's also a bunch of things that (as far as we know) Apple doesn't yet do but are clearly feasible soon (if not this year then 2019 or 2020). These include things that just improve at the system level (hardware compressed pages), or remote atomics (to speed up common atomic operations like ARC increments);
that make the L3 cache effectively larger (compressed lines in L3 and/or in [non compressed page] RAM);
or an effectively very large LLC (maybe an L3, maybe a "before RAM" "L4") implemented not in eDRAM the way these things used to be, but in STT MRAM.
To give just one example. The A10 and A10X are clocked the same, but A10X is about 17% higher single-threaded GB4. This is through just a larger L2 (8MiB vs 3MiB -- and that's not as drastic as it seems bcs A10 has an additional 4MiB L3), and through A10X having what appears to be the same single-channel memory controller, but with a 128-bit wide rather than 64-wide interface to RAM. ie just these two uncore tweaks get you an awful lot.
Now A11 has picked up some of this (L2 appears to be 8 MiB shared by all 2+4 cores) but the larger point is that there is a lot of value to be mined from uncore improvements, and Apple has been very willing from each generation to the next, to tinker with the details of this to hit the optimal point.
> AM (myname4rwt.delete@this.jee-male.com) on August 18, 2018 1:06 pm wrote:
> > > > What they promised was 40% higher perf in the same power envelope as A75. And that for A76 on 7nm vs A75
> > > > on 10nm.
> > >
> > > Considering that the difference between 7nm and 10nm is
> > > less than 40% it does imply improved uArch efficiency.
> >
> > This is not true. E.g. TSMC has exactly 40% power reduction in its 7nm tech vs 10nm.
>
>
> TSMC promising 40% power reduction in 7nm versus 10nm does not imply anything about performance. They say "20%
> better performance, 40% better power" but those types of things have always been one or the other in the past.
> i.e. you can either get the same performance at 40% less power or 20% more performance at the same power.
>
> If ARM is promising 40% more performance, and get 20% from the process due to faster transistor switching speeds,
> they have to get the other 20% from greater IPC, assuming they aren't also increasing the power budget.
>
> Of course, it would be helpful to know what ARM uses to do its measurements. It is possible to achieve huge improvements
> in SIMD heavy code by allowing the issue/retirement of more SIMD instructions per cycle. If you use (or assume)
> faster or lower latency RAM, you can get big improvements in FP. Getting improvements in twisty branchy integer
> code is the most difficult since it isn't just one change but a lot of them to get there. But presumably that's
> what they are targeting since SIMD and FP are less important on phones than integer code.
>
> There's obviously ample room for ARM to make such improvements. Looking at the Galaxy S9 version using the
> SD845 w/A75 @ 2.8 GHz versus the iPhone X using the A11 @ 2.39 GHz, despite the 17% clock rate handicap the
> iPhone is ~70% faster on GB4 integer, and that holds even if you focus in on the more twisty branchy subtests
> like LLVM or HTML5 DOM. So basically double the IPC. What's surprising is not that ARM is claiming a 20% improvement,
> and further improvements in the future. What's surprising is that they fell that far behind.
>
> So I don't think there's any reason to be skeptical about ARM getting a 20% improvement here, though
> that still leaves them at minimum 50% behind Apple - since Apple will get the same 20% improvement from
> TSMC, and presumably makes further IPC improvements in the A12. Very significant improvements, if you
> believe Charlie's claims about the A12 of 50% greater single thread performance... I'm kinda skeptical
> of that myself, but I guess we'll see next month. Almost makes me wonder if the numbers he saw were benchmarks
> of an A12X going into an iPad Pro, or possibly even a prototype of something targeted at laptops. 50%
> higher when they are only getting 20% from process would mean another huge IPC gain.
Is the 50% outrageous? Well who knows. But
- if you like to go by pattern matching (insofar as that describes what different teams are doing and their schedules) we had something like
- A6 was "safe"
- A7 was great leap forward
- A8 was safe tweaks+improvements
- A9 was great leap forward
- A10, A11 were safe tweaks and improvements
- meaning if there are multiple teams, one of which is the "superstar team" tasked with great leaps forward, it might be time for their next product?
As for whether it's possible, I'd argue it absolutely is. One problem, of course, is that we don't know what Apple has ALREADY implemented to get them to 50% higher IPC than Intel, but the copious pool of KIP enhancements seems so far to be only lightly tapped (and most of these ideas are orthogonal and can be implemented one at a time, not all at once).
So there's scope for a lot less dead time waiting on data.
For waiting on instructions, we've known for years that you can reduce I-cache waits to basically nothing (except for particular types of server code) through things like
- ONLY train the pre-fetchers on RETIRED "app" instructions. Don't allow interrupts, or mispredicted instructions, to pollute your prefetchers (and ideally do something like mark those lines as LRU rather than MRU)
- code in L2 is more valuable than data in L2, so give it a better chance of surviving there, through various means (this can be through placement/replacement priorities, or explicitly flagging it as I-lines)
Finally for actual execution, there's scope for a nice win (for many purposes) by aggressive use of
- ALU instructions that feed into the same variable (something like R3=R4+R5 AND R6)
- compiled to use the same intermediate register and placed back to back so
R3= R5 & R6
R3= R3 + R4
- fused in decode
- sent through a double-pumped ALU that can do the expected "easy" ops (+,-, logicals, MAYBE small shifts?) in a half cycle ala P4
This doesn't cover everything, obviously, but it gets a fraction of your code resolving dependencies at twice the nominal frequency, without much real cost.
Of course for optimal value this means you need to expand decode proper maybe from 6 to 8 instructions per cycle, and you need an extremely aggressive fetch system; but both of these are feasible with every instruction as 4bytes wide and no need to care about ARMv7 and THUMB any more.
There's also a whole bunch of interesting work on criticality analysis that basically allows you to get all the benefit of OoO and speculation where it's valuable, while not paying that cost (in power and in mispeculation+recovery) for the (surprisingly common) case of non-critical instructions. This looks to me like very useful work that's not been appreciated enough in academia, and that has not (yet...?) had commercial implementation.
There's also a bunch of things that (as far as we know) Apple doesn't yet do but are clearly feasible soon (if not this year then 2019 or 2020). These include things that just improve at the system level (hardware compressed pages), or remote atomics (to speed up common atomic operations like ARC increments);
that make the L3 cache effectively larger (compressed lines in L3 and/or in [non compressed page] RAM);
or an effectively very large LLC (maybe an L3, maybe a "before RAM" "L4") implemented not in eDRAM the way these things used to be, but in STT MRAM.
To give just one example. The A10 and A10X are clocked the same, but A10X is about 17% higher single-threaded GB4. This is through just a larger L2 (8MiB vs 3MiB -- and that's not as drastic as it seems bcs A10 has an additional 4MiB L3), and through A10X having what appears to be the same single-channel memory controller, but with a 128-bit wide rather than 64-wide interface to RAM. ie just these two uncore tweaks get you an awful lot.
Now A11 has picked up some of this (L2 appears to be 8 MiB shared by all 2+4 cores) but the larger point is that there is a lot of value to be mined from uncore improvements, and Apple has been very willing from each generation to the next, to tinker with the details of this to hit the optimal point.
Topic | Posted By | Date |
---|---|---|
ARM turns to a god and a hero | AM | 2018/08/16 08:32 AM |
ARM turns to a god and a hero | Maynard Handley | 2018/08/16 08:41 AM |
ARM turns to a god and a hero | Doug S | 2018/08/16 10:11 AM |
ARM turns to a god and a hero | Geoff Langdale | 2018/08/16 10:59 PM |
ARM turns to a god and a hero | dmcq | 2018/08/17 04:12 AM |
ARM is somewhat misleading | Adrian | 2018/08/16 10:56 PM |
It's marketing material | Gabriele Svelto | 2018/08/17 12:00 AM |
It's marketing material | Michael S | 2018/08/17 02:13 AM |
It's marketing material | dmcq | 2018/08/17 04:23 AM |
It's marketing material | Andrei Frumusanu | 2018/08/17 06:25 AM |
It's marketing material | Linus Torvalds | 2018/08/17 10:20 AM |
It's marketing material | Groo | 2018/08/17 12:44 PM |
It's marketing material | Doug S | 2018/08/17 01:14 PM |
promises and deliveries | AM | 2018/08/17 01:32 PM |
promises and deliveries | Passing Through | 2018/08/17 02:02 PM |
Just by way of clarification | Passing Through | 2018/08/17 02:15 PM |
Just by way of clarification | AM | 2018/08/18 11:49 AM |
Just by way of clarification | Passing Through | 2018/08/18 12:34 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 12:54 PM |
This ain't the nineties any longer | Maynard Handley | 2018/08/18 01:50 PM |
This ain't the nineties any longer | Passing Through | 2018/08/18 02:57 PM |
This ain't the nineties any longer | Passing Through | 2018/09/06 01:42 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 03:10 PM |
This ain't the nineties any longer | Passing Through | 2018/09/07 03:48 PM |
This ain't the nineties any longer | Maynard Handley | 2018/09/07 04:22 PM |
Just by way of clarification | Wilco | 2018/08/18 12:26 PM |
Just by way of clarification | Passing Through | 2018/08/18 12:39 PM |
Just by way of clarification | none | 2018/08/18 09:52 PM |
Just by way of clarification | dmcq | 2018/08/19 07:32 AM |
Just by way of clarification | none | 2018/08/19 07:54 AM |
Just by way of clarification | dmcq | 2018/08/19 10:24 AM |
Just by way of clarification | none | 2018/08/19 10:52 AM |
Just by way of clarification | Gabriele Svelto | 2018/08/19 05:41 AM |
Just by way of clarification | Passing Through | 2018/08/19 08:25 AM |
Whiteboards at Gatwick airport anyone? | Passing Through | 2018/08/20 03:24 AM |
It's marketing material | Michael S | 2018/08/18 10:12 AM |
It's marketing material | Brett | 2018/08/18 04:22 PM |
It's marketing material | Brett | 2018/08/18 04:33 PM |
It's marketing material | Adrian | 2018/08/19 12:21 AM |
A76 | AM | 2018/08/17 01:45 PM |
A76 | Michael S | 2018/08/18 10:20 AM |
A76 | AM | 2018/08/18 11:39 AM |
A76 | Michael S | 2018/08/18 11:49 AM |
A76 | AM | 2018/08/18 12:06 PM |
A76 | Doug S | 2018/08/18 12:43 PM |
A76 | Maynard Handley | 2018/08/18 01:42 PM |
A76 | Maynard Handley | 2018/08/18 03:22 PM |
Why write zeros when one can use metadata? | Paul A. Clayton | 2018/08/18 05:19 PM |
Why write zeros when one can use metadata? | Maynard Handley | 2018/08/19 10:12 AM |
Dictionary compress might apply to memcopy | Paul A. Clayton | 2018/08/19 12:45 PM |
Instructions for zeroing | Konrad Schwarz | 2018/08/30 05:37 AM |
Instructions for zeroing | Maynard Handley | 2018/08/30 07:41 AM |
Instructions for zeroing | Adrian | 2018/08/30 10:37 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | hobold | 2018/08/31 12:50 AM |
dcbz -> dcbzl (was: Instructions for zeroing) | dmcq | 2018/09/01 04:28 AM |
A76 | Travis | 2018/08/19 10:36 AM |
A76 | Maynard Handley | 2018/08/19 11:22 AM |
A76 | Travis | 2018/08/19 01:07 PM |
A76 | Maynard Handley | 2018/08/19 05:24 PM |
Remote atomics | matthew | 2018/08/19 11:51 AM |
Remote atomics | Michael S | 2018/08/19 12:58 PM |
Remote atomics | matthew | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:36 PM |
Remote atomics | matthew | 2018/08/19 01:48 PM |
Remote atomics | Michael S | 2018/08/19 02:16 PM |
Remote atomics | Ricardo B | 2018/08/20 09:05 AM |
Remote atomics | dmcq | 2018/08/19 01:33 PM |
Remote atomics | Travis | 2018/08/19 01:32 PM |
Remote atomics | Michael S | 2018/08/19 01:46 PM |
Remote atomics | Travis | 2018/08/19 04:35 PM |
Remote atomics | Michael S | 2018/08/20 02:29 AM |
Remote atomics | matthew | 2018/08/19 06:58 PM |
Remote atomics | anon | 2018/08/19 11:59 PM |
Remote atomics | Travis | 2018/08/20 09:26 AM |
Remote atomics | Travis | 2018/08/20 08:57 AM |
Remote atomics | Linus Torvalds | 2018/08/20 03:29 PM |
Fitting time slices to execution phases | Paul A. Clayton | 2018/08/21 08:09 AM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 01:34 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 02:31 PM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/21 02:54 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:26 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:21 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 03:39 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:59 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 04:13 PM |
Fitting time slices to execution phases | anon | 2018/08/21 03:27 PM |
Fitting time slices to execution phases | Linus Torvalds | 2018/08/21 05:02 PM |
Fitting time slices to execution phases | Etienne | 2018/08/22 01:28 AM |
Fitting time slices to execution phases | Gabriele Svelto | 2018/08/22 02:07 PM |
Fitting time slices to execution phases | Travis | 2018/08/22 03:00 PM |
Fitting time slices to execution phases | anon | 2018/08/22 05:52 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 03:37 PM |
Is preventing misuse that complex? | Paul A. Clayton | 2018/08/23 04:42 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 11:46 AM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:29 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 12:33 PM |
Is preventing misuse that complex? | Jeff S. | 2018/08/24 06:57 AM |
Is preventing misuse that complex? | Travis | 2018/08/24 07:47 AM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/23 01:30 PM |
Is preventing misuse that complex? | Travis | 2018/08/23 02:11 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:00 PM |
Is preventing misuse that complex? | Gabriele Svelto | 2018/08/24 12:25 PM |
Is preventing misuse that complex? | Linus Torvalds | 2018/08/24 12:33 PM |
Fitting time slices to execution phases | Travis | 2018/08/21 02:54 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 02:18 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 02:59 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 03:27 PM |
rseq: holy grail rwlock? | Linus Torvalds | 2018/08/21 04:10 PM |
rseq: holy grail rwlock? | Travis | 2018/08/21 05:21 PM |
ARM design houses | Michael S | 2018/08/21 04:07 AM |
ARM design houses | Wilco | 2018/08/22 11:38 AM |
ARM design houses | Michael S | 2018/08/22 01:21 PM |
ARM design houses | Wilco | 2018/08/22 02:23 PM |
ARM design houses | Michael S | 2018/08/29 12:58 AM |
Qualcomm's core naming scheme really, really sucks | Heikki Kultala | 2018/08/29 01:19 AM |
A76 | Maynard Handley | 2018/08/18 01:07 PM |
A76 | Michael S | 2018/08/18 01:32 PM |
A76 | Maynard Handley | 2018/08/18 01:52 PM |
A76 | Michael S | 2018/08/18 02:04 PM |
ARM is somewhat misleading | juanrga | 2018/08/17 12:20 AM |
Surprised?? | Alberto | 2018/08/17 12:52 AM |
Surprised?? | Alberto | 2018/08/17 01:10 AM |
Surprised?? | none | 2018/08/17 01:46 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 06:30 AM |
Garbage talk | Michael S | 2018/08/17 06:43 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:51 AM |
Garbage talk | Michael S | 2018/08/18 10:29 AM |
Garbage talk | Adrian | 2018/08/17 07:28 AM |
Garbage talk | Alberto | 2018/08/17 08:20 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 08:48 AM |
Garbage talk | Adrian | 2018/08/17 09:17 AM |
Garbage talk | Andrei Frumusanu | 2018/08/17 09:36 AM |
Garbage talk | Adrian | 2018/08/17 01:53 PM |
Garbage talk | Andrei Frumusanu | 2018/08/17 11:17 PM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 08:13 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 08:34 AM |
More like a religion he?? ARM has an easy life :) | Alberto | 2018/08/17 09:03 AM |
More like a religion he?? ARM has an easy life :) | Andrei Frumusanu | 2018/08/17 09:43 AM |
More like a religion he?? ARM has an easy life :) | Doug S | 2018/08/17 01:17 PM |
15W phone SoCs | AM | 2018/08/17 02:04 PM |
More like a religion he?? ARM has an easy life :) | Maynard Handley | 2018/08/17 11:29 AM |
my future stuff will be better than your old stuff, hey I'm a god at last (NT) | Eric Bron | 2018/08/18 02:34 AM |
my future stuff will be better than your old stuff, hey I'm a god at last | none | 2018/08/18 07:34 AM |