By: Simon Farnsworth (simon.delete@this.farnz.org.uk), October 28, 2016 6:19 am
Room: Moderated Discussions
juanrga (noemail.delete@this.juanrga.com) on October 28, 2016 2:02 am wrote:
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on October 25, 2016 11:03 am wrote:
> > juanrga (noemail.delete@this.juanrga.com) on October 25, 2016 9:57 am wrote:
> > > anon (spam.delete.delete@this.this.spam.com) on October 23, 2016 7:25 am wrote:
> > > > juanrga (noemail.delete@this.juanrga.com) on October 23, 2016 6:09 am wrote:
> > > > > anon (spam.delete@this.spam.com) on October 22, 2016 8:52 am wrote:
> > > > >
> > > > > > I mean
> > > > >
> > > > > > > Apple doesn’t always have the best performance per square millimeter,
> > > > > > > writes Gwennap, but it makes up for it in efficiency per clock cycle
> > > > >
> > > > > > that's not how it works.
> > > > >
> > > > > His first claim is correct, Apple Hurricane doesn't have the best performance per area,
> > > > > but this is expected because it is a latency-optimized core not a throughput optimized-core.
> > > > > About his second claim if by "efficiency per clock cycle" he means IPC/Area then his claim
> > > > > is wrong or right depending if he is comparing to Intel or to other ARM cores.
> > > >
> > > > My point is that perf = clockrate * ipc. Whether the ipc is high with low clockrates
> > > > or abysmal with insane clockrates doesn't matter at all for perf/area. Same
> > > > perf and same area mean same perf/area, regardless of the ipc.
> > >
> > > But he talks about "efficiency per clock cycle" which suggest he is talking about
> > > IPC/Area, not about Perf/Area. And the superior IPC/Area of Apple chips compared
> > > to Intel chips is related to ARM64 efficiency: the well-known "x86 tax".
> > >
> > > > IPC/area is nice and all but it doesn't buy you anything. I can get you tremendous IPC
> > > > by running the core so slow that I get a RAM to register load to use latency of 1 cycle.
> > >
> > > The variation of IPC with clocks is very small and you can only get huge IPC gains by
> > > setting extremely low clocks, but that is not happening here. Hurricane is clocked at
> > > 2.34GHz. Underclocking a 4GHz Haswell chip to 2GHz increases the IPC by less than 5%.
> > > Apple achieving IPC parity with best Intel designs is not due to lower clocks...
> >
> > That claim does not fit my understanding of how IPC gets
> > exploited in real world chips. Downclocking Haswell
> > won't increase IPC by much, because the design is for high clock rates, and thus the increased IPC from a
> > lower clock is only available because the ratio between memory speed and processor speed is reduced.
> >
> > However, if you're designing to a target clock speed, you can get much higher IPC on a comparable
> > process if your clock speed is lower than if it's higher; this is simply because if the processes
> > are comparable, the FO4 time is comparable, but at 2 GHz, you can fit twice as many FO4 time units
> > (thus twice as many transistors) in the critical path compared to a 4 GHz clock.
> >
> > Thus, for your claim to be true, either the process Apple is using is far behind Intel, such that
> > the FO4 time is about twice that of the Intel process (so Apple get the same number of transistors
> > in the critical path as Intel, but at half the clock speed), or Apple is leaving performance
> > on the table, by designing for a target clock of 4 GHz, then only achieving 2 GHz, when they could
> > achieve higher IPC and higher performance by designing around the 2 GHz target clock.
> >
> > Assuming that Apple aren't being idiots, and that TSMC/GloFo/Samsung
> > processes are comparable to Intel's processes
> > (within 20%, say), the most likely explanation is that they're
> > getting their IPC by exploiting the longer clock
> > cycles to run more logic per clock cycle. This, in turn,
> > means that the chip is unlikely to scale to the same
> > high clock speeds as an Intel chip does, because they run out of FO4 delay as the clock goes up.
> >
> > Equally, of course, this implies that an Intel core run at mobile speeds is leaving performance on
> > the table - you've designed around the constraints of high speed operation, then decided to clock lower,
> > when you could have designed for the lower clock, and had more logic running per clock cycle.
>
> Essentially the same rule applies upwards and downwards, only parameters vary.
>
> If your design is optimized for 4GHz and underclocking to 2GHz increases the IPC by less than 5%, then
> if your design is optimized for 2GHz, overclocking it to 4GHz will reduce the IPC by a similar amount.
The same rule does not apply upward and downward. If it did, I could reliably overclock a Pentium III to 4 GHz, just as I can downclock a modern Intel chip to 800 MHz.
If you're underclocking, then you're increasing the number of FO4 delays per clock cycle; a design that's reliable running at 3 GHz with 15 FO4 delays per clock cycle (thus the critical path stabilises in under 15 FO4 delays) is also reliable at 2 GHz with 22.5 FO4 delays per clock cycle - it's just in the stable state for an extra 7.5 FO4 delays.
If you're overclocking, however, you can make the chip unstable; if the design is reliable with 22.5 FO4 delays per clock cycle at 2 GHz, and you've designed a critical path that's 20 FO4 delays long (because 2 GHz was your target frequency), you cannot run stably at much beyond 2.2 GHz, as beyond that, you have a critical path that does not stabilise before the next clock cycle.
Thus, there are two routes to showing that Apple's design are equivalent to Intel's:
1. Show that Intel's process has a shorter FO4 delay, and that the performance differential goes away once you account for Intel's process advantage (e.g. "at 5 GHz, Skylake's critical path on an Intel process is 20 FO4 delays; at 2.5 GHz, Hurricane's critical path on a TSMC process is 20 FO4 delays. Hurricane is the same perf/FO4 delay as Skylake, although half the perf of Skylake, thus Apple has a great design held back by poor process tech.")
2. Show that the fastest Apple chip gets the same performance at rated clock speed as the fastest Intel chip gets at rated clock speed. No games here with perf/Hz, or perf/watt, or choosing a mobile Intel chip instead of the fastest they sell - just show that if I buy the best Intel has to offer me, and the best Apple has to offer me, I get comparable performance.
The second of these is uncontroversial, and hard - it probably needs Apple to co-operate, because of the TDP difference between Intel's fastest at over 100W TDP, and Apple's standard TDPs. The first is more challenging to do - it needs some careful study of the silicon from both fabs to establish the critical paths, and their length in FO4 delays or an equivalent metric.
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on October 25, 2016 11:03 am wrote:
> > juanrga (noemail.delete@this.juanrga.com) on October 25, 2016 9:57 am wrote:
> > > anon (spam.delete.delete@this.this.spam.com) on October 23, 2016 7:25 am wrote:
> > > > juanrga (noemail.delete@this.juanrga.com) on October 23, 2016 6:09 am wrote:
> > > > > anon (spam.delete@this.spam.com) on October 22, 2016 8:52 am wrote:
> > > > >
> > > > > > I mean
> > > > >
> > > > > > > Apple doesn’t always have the best performance per square millimeter,
> > > > > > > writes Gwennap, but it makes up for it in efficiency per clock cycle
> > > > >
> > > > > > that's not how it works.
> > > > >
> > > > > His first claim is correct, Apple Hurricane doesn't have the best performance per area,
> > > > > but this is expected because it is a latency-optimized core not a throughput optimized-core.
> > > > > About his second claim if by "efficiency per clock cycle" he means IPC/Area then his claim
> > > > > is wrong or right depending if he is comparing to Intel or to other ARM cores.
> > > >
> > > > My point is that perf = clockrate * ipc. Whether the ipc is high with low clockrates
> > > > or abysmal with insane clockrates doesn't matter at all for perf/area. Same
> > > > perf and same area mean same perf/area, regardless of the ipc.
> > >
> > > But he talks about "efficiency per clock cycle" which suggest he is talking about
> > > IPC/Area, not about Perf/Area. And the superior IPC/Area of Apple chips compared
> > > to Intel chips is related to ARM64 efficiency: the well-known "x86 tax".
> > >
> > > > IPC/area is nice and all but it doesn't buy you anything. I can get you tremendous IPC
> > > > by running the core so slow that I get a RAM to register load to use latency of 1 cycle.
> > >
> > > The variation of IPC with clocks is very small and you can only get huge IPC gains by
> > > setting extremely low clocks, but that is not happening here. Hurricane is clocked at
> > > 2.34GHz. Underclocking a 4GHz Haswell chip to 2GHz increases the IPC by less than 5%.
> > > Apple achieving IPC parity with best Intel designs is not due to lower clocks...
> >
> > That claim does not fit my understanding of how IPC gets
> > exploited in real world chips. Downclocking Haswell
> > won't increase IPC by much, because the design is for high clock rates, and thus the increased IPC from a
> > lower clock is only available because the ratio between memory speed and processor speed is reduced.
> >
> > However, if you're designing to a target clock speed, you can get much higher IPC on a comparable
> > process if your clock speed is lower than if it's higher; this is simply because if the processes
> > are comparable, the FO4 time is comparable, but at 2 GHz, you can fit twice as many FO4 time units
> > (thus twice as many transistors) in the critical path compared to a 4 GHz clock.
> >
> > Thus, for your claim to be true, either the process Apple is using is far behind Intel, such that
> > the FO4 time is about twice that of the Intel process (so Apple get the same number of transistors
> > in the critical path as Intel, but at half the clock speed), or Apple is leaving performance
> > on the table, by designing for a target clock of 4 GHz, then only achieving 2 GHz, when they could
> > achieve higher IPC and higher performance by designing around the 2 GHz target clock.
> >
> > Assuming that Apple aren't being idiots, and that TSMC/GloFo/Samsung
> > processes are comparable to Intel's processes
> > (within 20%, say), the most likely explanation is that they're
> > getting their IPC by exploiting the longer clock
> > cycles to run more logic per clock cycle. This, in turn,
> > means that the chip is unlikely to scale to the same
> > high clock speeds as an Intel chip does, because they run out of FO4 delay as the clock goes up.
> >
> > Equally, of course, this implies that an Intel core run at mobile speeds is leaving performance on
> > the table - you've designed around the constraints of high speed operation, then decided to clock lower,
> > when you could have designed for the lower clock, and had more logic running per clock cycle.
>
> Essentially the same rule applies upwards and downwards, only parameters vary.
>
> If your design is optimized for 4GHz and underclocking to 2GHz increases the IPC by less than 5%, then
> if your design is optimized for 2GHz, overclocking it to 4GHz will reduce the IPC by a similar amount.
The same rule does not apply upward and downward. If it did, I could reliably overclock a Pentium III to 4 GHz, just as I can downclock a modern Intel chip to 800 MHz.
If you're underclocking, then you're increasing the number of FO4 delays per clock cycle; a design that's reliable running at 3 GHz with 15 FO4 delays per clock cycle (thus the critical path stabilises in under 15 FO4 delays) is also reliable at 2 GHz with 22.5 FO4 delays per clock cycle - it's just in the stable state for an extra 7.5 FO4 delays.
If you're overclocking, however, you can make the chip unstable; if the design is reliable with 22.5 FO4 delays per clock cycle at 2 GHz, and you've designed a critical path that's 20 FO4 delays long (because 2 GHz was your target frequency), you cannot run stably at much beyond 2.2 GHz, as beyond that, you have a critical path that does not stabilise before the next clock cycle.
Thus, there are two routes to showing that Apple's design are equivalent to Intel's:
1. Show that Intel's process has a shorter FO4 delay, and that the performance differential goes away once you account for Intel's process advantage (e.g. "at 5 GHz, Skylake's critical path on an Intel process is 20 FO4 delays; at 2.5 GHz, Hurricane's critical path on a TSMC process is 20 FO4 delays. Hurricane is the same perf/FO4 delay as Skylake, although half the perf of Skylake, thus Apple has a great design held back by poor process tech.")
2. Show that the fastest Apple chip gets the same performance at rated clock speed as the fastest Intel chip gets at rated clock speed. No games here with perf/Hz, or perf/watt, or choosing a mobile Intel chip instead of the fastest they sell - just show that if I buy the best Intel has to offer me, and the best Apple has to offer me, I get comparable performance.
The second of these is uncontroversial, and hard - it probably needs Apple to co-operate, because of the TDP difference between Intel's fastest at over 100W TDP, and Apple's standard TDPs. The first is more challenging to do - it needs some careful study of the silicon from both fabs to establish the critical paths, and their length in FO4 delays or an equivalent metric.
Topic | Posted By | Date |
---|---|---|
Neat die area comparison image | Rob | 2016/10/21 05:39 PM |
Neat die area comparison image | anonymou5 | 2016/10/21 06:44 PM |
Neat die area comparison image | Mr. Camel | 2016/10/22 04:58 AM |
Neat die area comparison image | Heikki Kultala | 2016/10/22 05:19 AM |
Neat die area comparison image | Mr. Camel | 2016/10/22 07:10 AM |
Neat die area comparison image | Mr. Camel | 2016/10/22 07:15 AM |
different caches... | Heikki Kultala | 2016/10/22 08:29 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/22 08:52 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/23 06:09 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/23 07:25 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/25 09:57 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/25 11:03 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/28 02:02 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/28 04:13 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/29 09:47 PM |
Broadwell includes LLC, just for comparision | Travis | 2016/10/30 06:34 PM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/31 04:35 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/31 04:42 AM |
Broadwell includes LLC, just for comparision | anon | 2016/11/01 12:56 PM |
Broadwell includes LLC, just for comparision | Maynard Handley | 2016/11/01 01:37 PM |
Broadwell includes LLC, just for comparision | anon | 2016/11/01 04:22 PM |
Broadwell includes LLC, just for comparision | Maynard Handley | 2016/11/01 07:30 PM |
Broadwell includes LLC, just for comparision | anon | 2016/11/02 06:15 AM |
Broadwell includes LLC, just for comparision | Maynard Handley | 2016/11/02 09:23 AM |
Broadwell includes LLC, just for comparision | anon | 2016/11/02 11:50 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/11/02 02:48 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/28 06:19 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/29 10:15 PM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/30 12:31 PM |
Broadwell includes LLC, just for comparision | Ricardo B | 2016/10/29 05:30 PM |
underclocked is different than designed for low clock speed | Heikki Kultala | 2016/10/25 11:47 PM |
underclocked is different than designed for low clock speed | Maynard Handley | 2016/10/26 10:07 AM |
That wasn't the point | juanrga | 2016/10/28 02:15 AM |
Even without the point you have invalid comparison | Heikki Kultala | 2016/10/28 09:03 AM |
8 wide vs 6 wide | juanrga | 2016/10/29 10:41 PM |
8 wide vs 6 wide | Wilco | 2016/10/30 05:00 AM |
8 wide vs 6 wide | Doug S | 2016/10/30 12:20 PM |
8 wide vs 6 wide | Wilco | 2016/10/30 01:12 PM |
8 wide vs 6 wide | juanrga | 2016/10/30 02:56 PM |
8 wide vs 6 wide | Travis | 2016/10/30 07:13 PM |
8 wide vs 6 wide | juanrga | 2016/10/31 04:55 AM |
8 wide vs 6 wide | anon | 2016/11/01 01:00 PM |
SoftMachines | none | 2016/11/02 03:57 AM |
SoftMachines | David Kanter | 2016/11/02 08:53 AM |
8 wide vs 6 wide | juanrga | 2016/11/03 12:35 PM |
8 wide vs 6 wide | Wilco | 2016/11/03 02:13 PM |
8 wide vs 6 wide | juanrga | 2016/11/03 07:35 PM |
8 wide vs 6 wide | Wilco | 2016/11/04 01:27 PM |
8 wide vs 6 wide | juanrga | 2016/11/04 06:08 PM |
8 wide vs 6 wide | Wilco | 2016/11/06 04:52 AM |
8 wide vs 6 wide | juanrga | 2016/11/06 04:56 PM |
8 wide vs 6 wide | Wilco | 2016/11/07 04:25 AM |
8 wide vs 6 wide | Aaron Spink | 2016/11/04 04:08 PM |
8 wide vs 6 wide | juanrga | 2016/11/04 06:10 PM |
Dunning-Krueger effect | Heikki Kultala | 2016/11/04 03:22 AM |
Dunning-Krueger effect | itsmydamnation | 2016/11/04 02:48 PM |
8 wide vs 6 wide | anon | 2016/11/04 03:38 AM |
8 wide vs 6 wide | juanrga | 2016/11/04 05:05 AM |
8 wide vs 6 wide | anon | 2016/11/04 06:12 AM |
8 wide vs 6 wide | Wilco | 2016/11/04 01:12 PM |
8 wide vs 6 wide | anon | 2016/11/04 02:54 PM |
8 wide vs 6 wide | juanrga | 2016/11/04 05:34 PM |
8 wide vs 6 wide | anon | 2016/11/05 02:14 AM |
8 wide vs 6 wide | juanrga | 2016/11/04 05:39 PM |
8 wide vs 6 wide | Wilco | 2016/11/06 05:15 AM |
8 wide vs 6 wide | juanrga | 2016/11/06 05:06 PM |
8 wide vs 6 wide | Wilco | 2016/11/07 03:45 AM |
8 wide vs 6 wide | David Kanter | 2016/11/07 08:43 PM |
8 wide vs 6 wide | Wilco | 2016/11/08 03:57 AM |
8 wide vs 6 wide | juanrga | 2016/11/14 12:12 PM |
8 wide vs 6 wide | Wilco | 2016/11/14 04:53 PM |
8 wide vs 6 wide | dmcq | 2016/11/15 03:17 AM |
8 wide vs 6 wide | Wilco | 2016/11/15 03:43 AM |
8 wide vs 6 wide | dmcq | 2016/11/15 04:28 AM |
1 µop per instruction is not necessary | Paul A. Clayton | 2016/11/17 12:09 PM |
8 wide vs 6 wide | juanrga | 2016/11/20 06:56 AM |
8 wide vs 6 wide | Wilco | 2016/11/21 05:54 PM |
8 wide vs 6 wide | juanrga | 2016/11/22 08:49 AM |
8 wide vs 6 wide | Wilco | 2016/11/22 03:25 PM |
8 wide vs 6 wide | Wilco | 2016/10/31 03:03 AM |
Skylake can retire 8 uops | David Kanter | 2016/10/31 12:41 AM |
Skylake can retire 8 uops | juanrga | 2016/10/31 04:15 AM |
Skylake can retire 8 uops | Alberto | 2016/11/04 07:22 AM |
8 wide vs 6 wide bogus numbers | Heikki Kultala | 2016/10/30 06:25 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/26 03:10 AM |
Pushing the hidden agenda | juanrga | 2016/10/28 03:11 AM |
Pushing the hidden agenda | anon | 2016/10/28 04:35 AM |
Neat die area comparison image | David Hess | 2016/10/22 01:26 PM |
Neat die area comparison image | anon2 | 2016/10/22 05:20 PM |
Neat die area comparison image | David Hess | 2016/10/22 10:31 PM |
Neat die area comparison image | anon2 | 2016/10/23 01:50 AM |
Neat die area comparison image | Travis | 2016/10/24 01:26 PM |
Neat die area comparison image | Maynard Handley | 2016/10/24 04:27 PM |
Neat die area comparison image | juanrga | 2016/10/25 10:02 AM |
Neat die area comparison image | David Hess | 2016/10/25 09:59 PM |
Neat die area comparison image | Travis | 2016/10/25 10:22 PM |
Neat die area comparison image | David Hess | 2016/10/25 10:37 PM |
Neat die area comparison image | Travis | 2016/10/30 06:09 PM |
Neat die area comparison image | Gabriele Svelto | 2016/10/26 02:23 AM |
Neat die area comparison image | Doug S | 2016/10/26 08:17 AM |
Neat die area comparison image | Jukka Larja | 2016/10/27 09:28 AM |
Neat die area comparison image | anon | 2016/10/26 03:32 AM |
Neat die area comparison image | juanrga | 2016/10/23 06:29 AM |
Neat die area comparison image | Matthias Waldhauer | 2016/10/22 06:12 AM |
Neat die area comparison image | juanrga | 2016/10/23 05:44 AM |
Neat die area comparison image | Gabriele Svelto | 2016/10/24 02:17 AM |