By: Maynard Handley (name99.delete@this.name99.org), October 26, 2016 10:07 am
Room: Moderated Discussions
Heikki Kultala (heikki.kultala.delete@this.tut.fi) on October 25, 2016 11:47 pm wrote:
> juanrga (noemail.delete@this.juanrga.com) on October 25, 2016 9:57 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on October 23, 2016 7:25 am wrote:
> > > juanrga (noemail.delete@this.juanrga.com) on October 23, 2016 6:09 am wrote:
> > > > anon (spam.delete@this.spam.com) on October 22, 2016 8:52 am wrote:
> > > >
> > > > > I mean
> > > >
> > > > > > Apple doesn’t always have the best performance per square millimeter,
> > > > > > writes Gwennap, but it makes up for it in efficiency per clock cycle
> > > >
> > > > > that's not how it works.
> > > >
> > > > His first claim is correct, Apple Hurricane doesn't have the best performance per area,
> > > > but this is expected because it is a latency-optimized core not a throughput optimized-core.
> > > > About his second claim if by "efficiency per clock cycle" he means IPC/Area then his claim
> > > > is wrong or right depending if he is comparing to Intel or to other ARM cores.
> > >
> > > My point is that perf = clockrate * ipc. Whether the ipc is high with low clockrates
> > > or abysmal with insane clockrates doesn't matter at all for perf/area. Same
> > > perf and same area mean same perf/area, regardless of the ipc.
> >
> > But he talks about "efficiency per clock cycle" which suggest he is talking about
> > IPC/Area, not about Perf/Area. And the superior IPC/Area of Apple chips compared
> > to Intel chips is related to ARM64 efficiency: the well-known "x86 tax".
>
> No, it's mostly is because:
>
> 1) Intel chips use longer pipelines to achieve higher clock speeds
> and the longer pipelines costs transistors and chip area.
>
> 2) Intel has much beefier SIMD side which does nothing on integer benchmarks. Run FP SIMD codes and intel has
> much better performance/clock. Intel also does have a very beefy division unit which is used quite rarely.
Beefier SIMD, yes. MUCH beefier, is more ambiguous...
- Apple apparently has three 128b pipelines, Intel has two 256b pipelines. So 512 vs 384
- A single Intel instruction can do 2x as much work --- but that instruction is longer, and Apple can decode/dispatch more of them per cycle.
- Intel has gather. ARM doesn't have that (yet...) but has 2, 3, 4-wide "rearrangers" that catch many common cases (though not generic sparse linear algebra) and which are starting to be used by LLVM, though this has taken quite some time. And ARM has the write side of this which mainstream AVX does not.
You can see this when you compare somewhat like CPUs.
Compare A10 to i5-5200U @ 2.70 GHz, which is the closest reasonable matchup I can find.
The overall single threaded scores are 3510 vs 3519 and the two scores of interest (very obviously vector friendly) are
- SGEMM 48 vs 63 GFlops
- SFFT 7 vs 9.3 GFlops
Some of the other benchmarks are likely also vectorizable (HDR? Gaussian Blur? Speech Recognition?) but we can be less certain of those.
Perhaps Intel handles doubles better (both these Geekbench tests are single), but I've no reason to believe that. (And does AVX handle 16-bit floats?)
What's certainly the case for now is that Intel has a Fortran compiling story which is, I assume, of very little interest to Apple, but which allows some SPEC codes to be compiled; and Intel has a compiler that's *probably* better than Apple's at extracting vectorization from "generic" scientific code (and certainly at doing so for a very specific set of scientific code...) but those are both irrelevant to the area/underlying hardware claim.
> juanrga (noemail.delete@this.juanrga.com) on October 25, 2016 9:57 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on October 23, 2016 7:25 am wrote:
> > > juanrga (noemail.delete@this.juanrga.com) on October 23, 2016 6:09 am wrote:
> > > > anon (spam.delete@this.spam.com) on October 22, 2016 8:52 am wrote:
> > > >
> > > > > I mean
> > > >
> > > > > > Apple doesn’t always have the best performance per square millimeter,
> > > > > > writes Gwennap, but it makes up for it in efficiency per clock cycle
> > > >
> > > > > that's not how it works.
> > > >
> > > > His first claim is correct, Apple Hurricane doesn't have the best performance per area,
> > > > but this is expected because it is a latency-optimized core not a throughput optimized-core.
> > > > About his second claim if by "efficiency per clock cycle" he means IPC/Area then his claim
> > > > is wrong or right depending if he is comparing to Intel or to other ARM cores.
> > >
> > > My point is that perf = clockrate * ipc. Whether the ipc is high with low clockrates
> > > or abysmal with insane clockrates doesn't matter at all for perf/area. Same
> > > perf and same area mean same perf/area, regardless of the ipc.
> >
> > But he talks about "efficiency per clock cycle" which suggest he is talking about
> > IPC/Area, not about Perf/Area. And the superior IPC/Area of Apple chips compared
> > to Intel chips is related to ARM64 efficiency: the well-known "x86 tax".
>
> No, it's mostly is because:
>
> 1) Intel chips use longer pipelines to achieve higher clock speeds
> and the longer pipelines costs transistors and chip area.
>
> 2) Intel has much beefier SIMD side which does nothing on integer benchmarks. Run FP SIMD codes and intel has
> much better performance/clock. Intel also does have a very beefy division unit which is used quite rarely.
Beefier SIMD, yes. MUCH beefier, is more ambiguous...
- Apple apparently has three 128b pipelines, Intel has two 256b pipelines. So 512 vs 384
- A single Intel instruction can do 2x as much work --- but that instruction is longer, and Apple can decode/dispatch more of them per cycle.
- Intel has gather. ARM doesn't have that (yet...) but has 2, 3, 4-wide "rearrangers" that catch many common cases (though not generic sparse linear algebra) and which are starting to be used by LLVM, though this has taken quite some time. And ARM has the write side of this which mainstream AVX does not.
You can see this when you compare somewhat like CPUs.
Compare A10 to i5-5200U @ 2.70 GHz, which is the closest reasonable matchup I can find.
The overall single threaded scores are 3510 vs 3519 and the two scores of interest (very obviously vector friendly) are
- SGEMM 48 vs 63 GFlops
- SFFT 7 vs 9.3 GFlops
Some of the other benchmarks are likely also vectorizable (HDR? Gaussian Blur? Speech Recognition?) but we can be less certain of those.
Perhaps Intel handles doubles better (both these Geekbench tests are single), but I've no reason to believe that. (And does AVX handle 16-bit floats?)
What's certainly the case for now is that Intel has a Fortran compiling story which is, I assume, of very little interest to Apple, but which allows some SPEC codes to be compiled; and Intel has a compiler that's *probably* better than Apple's at extracting vectorization from "generic" scientific code (and certainly at doing so for a very specific set of scientific code...) but those are both irrelevant to the area/underlying hardware claim.
Topic | Posted By | Date |
---|---|---|
Neat die area comparison image | Rob | 2016/10/21 05:39 PM |
Neat die area comparison image | anonymou5 | 2016/10/21 06:44 PM |
Neat die area comparison image | Mr. Camel | 2016/10/22 04:58 AM |
Neat die area comparison image | Heikki Kultala | 2016/10/22 05:19 AM |
Neat die area comparison image | Mr. Camel | 2016/10/22 07:10 AM |
Neat die area comparison image | Mr. Camel | 2016/10/22 07:15 AM |
different caches... | Heikki Kultala | 2016/10/22 08:29 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/22 08:52 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/23 06:09 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/23 07:25 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/25 09:57 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/25 11:03 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/28 02:02 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/28 04:13 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/29 09:47 PM |
Broadwell includes LLC, just for comparision | Travis | 2016/10/30 06:34 PM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/31 04:35 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/31 04:42 AM |
Broadwell includes LLC, just for comparision | anon | 2016/11/01 12:56 PM |
Broadwell includes LLC, just for comparision | Maynard Handley | 2016/11/01 01:37 PM |
Broadwell includes LLC, just for comparision | anon | 2016/11/01 04:22 PM |
Broadwell includes LLC, just for comparision | Maynard Handley | 2016/11/01 07:30 PM |
Broadwell includes LLC, just for comparision | anon | 2016/11/02 06:15 AM |
Broadwell includes LLC, just for comparision | Maynard Handley | 2016/11/02 09:23 AM |
Broadwell includes LLC, just for comparision | anon | 2016/11/02 11:50 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/11/02 02:48 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/28 06:19 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/29 10:15 PM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/30 12:31 PM |
Broadwell includes LLC, just for comparision | Ricardo B | 2016/10/29 05:30 PM |
underclocked is different than designed for low clock speed | Heikki Kultala | 2016/10/25 11:47 PM |
underclocked is different than designed for low clock speed | Maynard Handley | 2016/10/26 10:07 AM |
That wasn't the point | juanrga | 2016/10/28 02:15 AM |
Even without the point you have invalid comparison | Heikki Kultala | 2016/10/28 09:03 AM |
8 wide vs 6 wide | juanrga | 2016/10/29 10:41 PM |
8 wide vs 6 wide | Wilco | 2016/10/30 05:00 AM |
8 wide vs 6 wide | Doug S | 2016/10/30 12:20 PM |
8 wide vs 6 wide | Wilco | 2016/10/30 01:12 PM |
8 wide vs 6 wide | juanrga | 2016/10/30 02:56 PM |
8 wide vs 6 wide | Travis | 2016/10/30 07:13 PM |
8 wide vs 6 wide | juanrga | 2016/10/31 04:55 AM |
8 wide vs 6 wide | anon | 2016/11/01 01:00 PM |
SoftMachines | none | 2016/11/02 03:57 AM |
SoftMachines | David Kanter | 2016/11/02 08:53 AM |
8 wide vs 6 wide | juanrga | 2016/11/03 12:35 PM |
8 wide vs 6 wide | Wilco | 2016/11/03 02:13 PM |
8 wide vs 6 wide | juanrga | 2016/11/03 07:35 PM |
8 wide vs 6 wide | Wilco | 2016/11/04 01:27 PM |
8 wide vs 6 wide | juanrga | 2016/11/04 06:08 PM |
8 wide vs 6 wide | Wilco | 2016/11/06 04:52 AM |
8 wide vs 6 wide | juanrga | 2016/11/06 04:56 PM |
8 wide vs 6 wide | Wilco | 2016/11/07 04:25 AM |
8 wide vs 6 wide | Aaron Spink | 2016/11/04 04:08 PM |
8 wide vs 6 wide | juanrga | 2016/11/04 06:10 PM |
Dunning-Krueger effect | Heikki Kultala | 2016/11/04 03:22 AM |
Dunning-Krueger effect | itsmydamnation | 2016/11/04 02:48 PM |
8 wide vs 6 wide | anon | 2016/11/04 03:38 AM |
8 wide vs 6 wide | juanrga | 2016/11/04 05:05 AM |
8 wide vs 6 wide | anon | 2016/11/04 06:12 AM |
8 wide vs 6 wide | Wilco | 2016/11/04 01:12 PM |
8 wide vs 6 wide | anon | 2016/11/04 02:54 PM |
8 wide vs 6 wide | juanrga | 2016/11/04 05:34 PM |
8 wide vs 6 wide | anon | 2016/11/05 02:14 AM |
8 wide vs 6 wide | juanrga | 2016/11/04 05:39 PM |
8 wide vs 6 wide | Wilco | 2016/11/06 05:15 AM |
8 wide vs 6 wide | juanrga | 2016/11/06 05:06 PM |
8 wide vs 6 wide | Wilco | 2016/11/07 03:45 AM |
8 wide vs 6 wide | David Kanter | 2016/11/07 08:43 PM |
8 wide vs 6 wide | Wilco | 2016/11/08 03:57 AM |
8 wide vs 6 wide | juanrga | 2016/11/14 12:12 PM |
8 wide vs 6 wide | Wilco | 2016/11/14 04:53 PM |
8 wide vs 6 wide | dmcq | 2016/11/15 03:17 AM |
8 wide vs 6 wide | Wilco | 2016/11/15 03:43 AM |
8 wide vs 6 wide | dmcq | 2016/11/15 04:28 AM |
1 µop per instruction is not necessary | Paul A. Clayton | 2016/11/17 12:09 PM |
8 wide vs 6 wide | juanrga | 2016/11/20 06:56 AM |
8 wide vs 6 wide | Wilco | 2016/11/21 05:54 PM |
8 wide vs 6 wide | juanrga | 2016/11/22 08:49 AM |
8 wide vs 6 wide | Wilco | 2016/11/22 03:25 PM |
8 wide vs 6 wide | Wilco | 2016/10/31 03:03 AM |
Skylake can retire 8 uops | David Kanter | 2016/10/31 12:41 AM |
Skylake can retire 8 uops | juanrga | 2016/10/31 04:15 AM |
Skylake can retire 8 uops | Alberto | 2016/11/04 07:22 AM |
8 wide vs 6 wide bogus numbers | Heikki Kultala | 2016/10/30 06:25 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/26 03:10 AM |
Pushing the hidden agenda | juanrga | 2016/10/28 03:11 AM |
Pushing the hidden agenda | anon | 2016/10/28 04:35 AM |
Neat die area comparison image | David Hess | 2016/10/22 01:26 PM |
Neat die area comparison image | anon2 | 2016/10/22 05:20 PM |
Neat die area comparison image | David Hess | 2016/10/22 10:31 PM |
Neat die area comparison image | anon2 | 2016/10/23 01:50 AM |
Neat die area comparison image | Travis | 2016/10/24 01:26 PM |
Neat die area comparison image | Maynard Handley | 2016/10/24 04:27 PM |
Neat die area comparison image | juanrga | 2016/10/25 10:02 AM |
Neat die area comparison image | David Hess | 2016/10/25 09:59 PM |
Neat die area comparison image | Travis | 2016/10/25 10:22 PM |
Neat die area comparison image | David Hess | 2016/10/25 10:37 PM |
Neat die area comparison image | Travis | 2016/10/30 06:09 PM |
Neat die area comparison image | Gabriele Svelto | 2016/10/26 02:23 AM |
Neat die area comparison image | Doug S | 2016/10/26 08:17 AM |
Neat die area comparison image | Jukka Larja | 2016/10/27 09:28 AM |
Neat die area comparison image | anon | 2016/10/26 03:32 AM |
Neat die area comparison image | juanrga | 2016/10/23 06:29 AM |
Neat die area comparison image | Matthias Waldhauer | 2016/10/22 06:12 AM |
Neat die area comparison image | juanrga | 2016/10/23 05:44 AM |
Neat die area comparison image | Gabriele Svelto | 2016/10/24 02:17 AM |