By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), November 8, 2016 3:57 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on November 7, 2016 7:43 pm wrote:
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on November 7, 2016 2:45 am wrote:
> > juanrga (noemail.delete@this.juanrga.com) on November 6, 2016 4:06 pm wrote:
> > > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on November 6, 2016 4:15 am wrote:
> > > > juanrga (noemail.delete@this.juanrga.com) on November 4, 2016 4:39 pm wrote:
> > > > > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on November 4, 2016 1:12 pm wrote:
> > > > > > anon (spam.delete.delete@this.this.spam.com) on November 4, 2016 6:12 am wrote:
> > > > > > > Yes, but I still don't see any reason why that means the A10 is 6 wide.
> > > > > > >
> > > > > > > If Sandy/Ivy Bridge is 6 wide then that doesn't mean Haswell/Broadwell is 6 wide.
> > > > > >
> > > > > > Neither are 6-wide using the standard definition.
> > > > >
> > > > > If you bother to read one of the resources given (the first link
> > > > > I gave), you will learn that there is no "standard" definition:
> > > >
> > > > Actually that same link does give a standard definition:
> > > >
> > > >
> > >
> > > Now continue reading the reference until you get to the point where he discusses if Haswell
> > > would be considered 4-wide, 5-wide, or 8-wide, why it depends on what definition of "wide"
> > > you use,
> >
> > No the definition by itself is clear. Instructions are instructions, there
> > is no room for argument. No Intel core can decode/execute 8 instructions
> > per cycle. Apple A7 can do 6, Haswell can do 4, it's as simple as that.
>
> I believe that Intel cores can actually do 5 instructions/clock with the macro-op fusion.
Yes if you have one branch every 5 instructions. Obviously other CPUs support fusion too.
> > > and why he choses 8-wide (as myself did, as the other references given also did).
> >
> > The reason some people choose 8-wide is to pretend Intel's cores are wider than other
> > CPUs. When you look internally both Apple A7 and Cortex-A57 are wider than Haswell.
>
> Even width itself is only relevant as an approximation for IPC. Having an 8-wide CPU with
> one load/store unit is no good for general purpose code (would work for some DSP, maybe).
>
> Also, not all instructions are equal. In particular, on vector codes, AVX could give x86
> an advantage. Similarly, on spill heavy code, ARMv7 may have an advantage using LDM/STM.
Absolutely, however the effect of ISA on IPC is fairly small given one executes mostly simple operations (for example load-op is rarely used on x86). But the discussion is about width.
> > In fact Apple A7 is wider than Haswell in every regard. So anyone claiming Haswell is 8 wide
> > but A7 is only 6 wide is simply lying because by the same measure A7 is actually 9 wide.
>
> It certainly appears that the A7 has similar IPC at low frequencies.
Indeed, and it looks like A10 has even better IPC despite almost doubling frequency.
> > > > So it's the number of instructions that one can process, not micro-ops
> > > > (as those vary significantly with the microarchitecture).
> > > >
> > >
> > > The issue is that what you call "instructions" is not what is reordered, issued,
> > > executed, tracked, and retired in the metal of a modern chip as Haswell, or
> > > Cyclone, or A72, or Hurricane, or Zen, or Vulcan, or Kaby Lake, or...
> >
> > There is an almost 1:1 correspondence between instructions
> > and micro-ops, so yes instructions are what matters.
>
> The median instruction decodes to 1 uop. But when you run into nasty ones, they
> are worth noting. I would always carefully consider both uops and instructions.
Sure, but most commonly used instructions are a single micro-op in unfused domain. Stores need 2 uops after rename since Haswell.
> In fact with Intel's uop cache, it's not even clear that you measure front-end width in
> instructions anymore...it could operate solely in the uop domain for extended periods.
That's quite possible, however the renamer still has a limit of 4 macro-ops and dispatch of unfused ops is limited to 5 in Haswell and 6 in Skylake. So without macro-op fusion there is no way you can ever execute more than 4 instructions per cycle.
Wilco
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on November 7, 2016 2:45 am wrote:
> > juanrga (noemail.delete@this.juanrga.com) on November 6, 2016 4:06 pm wrote:
> > > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on November 6, 2016 4:15 am wrote:
> > > > juanrga (noemail.delete@this.juanrga.com) on November 4, 2016 4:39 pm wrote:
> > > > > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on November 4, 2016 1:12 pm wrote:
> > > > > > anon (spam.delete.delete@this.this.spam.com) on November 4, 2016 6:12 am wrote:
> > > > > > > Yes, but I still don't see any reason why that means the A10 is 6 wide.
> > > > > > >
> > > > > > > If Sandy/Ivy Bridge is 6 wide then that doesn't mean Haswell/Broadwell is 6 wide.
> > > > > >
> > > > > > Neither are 6-wide using the standard definition.
> > > > >
> > > > > If you bother to read one of the resources given (the first link
> > > > > I gave), you will learn that there is no "standard" definition:
> > > >
> > > > Actually that same link does give a standard definition:
> > > >
> > > >
The number of instructions able to be issued, executed or completed per cycle is called a processor's
> > > > width. Note that the issue width is less than the number of functional units – this is typical.
> > >
> > > Now continue reading the reference until you get to the point where he discusses if Haswell
> > > would be considered 4-wide, 5-wide, or 8-wide, why it depends on what definition of "wide"
> > > you use,
> >
> > No the definition by itself is clear. Instructions are instructions, there
> > is no room for argument. No Intel core can decode/execute 8 instructions
> > per cycle. Apple A7 can do 6, Haswell can do 4, it's as simple as that.
>
> I believe that Intel cores can actually do 5 instructions/clock with the macro-op fusion.
Yes if you have one branch every 5 instructions. Obviously other CPUs support fusion too.
> > > and why he choses 8-wide (as myself did, as the other references given also did).
> >
> > The reason some people choose 8-wide is to pretend Intel's cores are wider than other
> > CPUs. When you look internally both Apple A7 and Cortex-A57 are wider than Haswell.
>
> Even width itself is only relevant as an approximation for IPC. Having an 8-wide CPU with
> one load/store unit is no good for general purpose code (would work for some DSP, maybe).
>
> Also, not all instructions are equal. In particular, on vector codes, AVX could give x86
> an advantage. Similarly, on spill heavy code, ARMv7 may have an advantage using LDM/STM.
Absolutely, however the effect of ISA on IPC is fairly small given one executes mostly simple operations (for example load-op is rarely used on x86). But the discussion is about width.
> > In fact Apple A7 is wider than Haswell in every regard. So anyone claiming Haswell is 8 wide
> > but A7 is only 6 wide is simply lying because by the same measure A7 is actually 9 wide.
>
> It certainly appears that the A7 has similar IPC at low frequencies.
Indeed, and it looks like A10 has even better IPC despite almost doubling frequency.
> > > > So it's the number of instructions that one can process, not micro-ops
> > > > (as those vary significantly with the microarchitecture).
> > > >
> > >
> > > The issue is that what you call "instructions" is not what is reordered, issued,
> > > executed, tracked, and retired in the metal of a modern chip as Haswell, or
> > > Cyclone, or A72, or Hurricane, or Zen, or Vulcan, or Kaby Lake, or...
> >
> > There is an almost 1:1 correspondence between instructions
> > and micro-ops, so yes instructions are what matters.
>
> The median instruction decodes to 1 uop. But when you run into nasty ones, they
> are worth noting. I would always carefully consider both uops and instructions.
Sure, but most commonly used instructions are a single micro-op in unfused domain. Stores need 2 uops after rename since Haswell.
> In fact with Intel's uop cache, it's not even clear that you measure front-end width in
> instructions anymore...it could operate solely in the uop domain for extended periods.
That's quite possible, however the renamer still has a limit of 4 macro-ops and dispatch of unfused ops is limited to 5 in Haswell and 6 in Skylake. So without macro-op fusion there is no way you can ever execute more than 4 instructions per cycle.
Wilco
Topic | Posted By | Date |
---|---|---|
Neat die area comparison image | Rob | 2016/10/21 05:39 PM |
Neat die area comparison image | anonymou5 | 2016/10/21 06:44 PM |
Neat die area comparison image | Mr. Camel | 2016/10/22 04:58 AM |
Neat die area comparison image | Heikki Kultala | 2016/10/22 05:19 AM |
Neat die area comparison image | Mr. Camel | 2016/10/22 07:10 AM |
Neat die area comparison image | Mr. Camel | 2016/10/22 07:15 AM |
different caches... | Heikki Kultala | 2016/10/22 08:29 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/22 08:52 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/23 06:09 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/23 07:25 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/25 09:57 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/25 11:03 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/28 02:02 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/28 04:13 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/29 09:47 PM |
Broadwell includes LLC, just for comparision | Travis | 2016/10/30 06:34 PM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/31 04:35 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/31 04:42 AM |
Broadwell includes LLC, just for comparision | anon | 2016/11/01 12:56 PM |
Broadwell includes LLC, just for comparision | Maynard Handley | 2016/11/01 01:37 PM |
Broadwell includes LLC, just for comparision | anon | 2016/11/01 04:22 PM |
Broadwell includes LLC, just for comparision | Maynard Handley | 2016/11/01 07:30 PM |
Broadwell includes LLC, just for comparision | anon | 2016/11/02 06:15 AM |
Broadwell includes LLC, just for comparision | Maynard Handley | 2016/11/02 09:23 AM |
Broadwell includes LLC, just for comparision | anon | 2016/11/02 11:50 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/11/02 02:48 AM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/28 06:19 AM |
Broadwell includes LLC, just for comparision | juanrga | 2016/10/29 10:15 PM |
Broadwell includes LLC, just for comparision | Simon Farnsworth | 2016/10/30 12:31 PM |
Broadwell includes LLC, just for comparision | Ricardo B | 2016/10/29 05:30 PM |
underclocked is different than designed for low clock speed | Heikki Kultala | 2016/10/25 11:47 PM |
underclocked is different than designed for low clock speed | Maynard Handley | 2016/10/26 10:07 AM |
That wasn't the point | juanrga | 2016/10/28 02:15 AM |
Even without the point you have invalid comparison | Heikki Kultala | 2016/10/28 09:03 AM |
8 wide vs 6 wide | juanrga | 2016/10/29 10:41 PM |
8 wide vs 6 wide | Wilco | 2016/10/30 05:00 AM |
8 wide vs 6 wide | Doug S | 2016/10/30 12:20 PM |
8 wide vs 6 wide | Wilco | 2016/10/30 01:12 PM |
8 wide vs 6 wide | juanrga | 2016/10/30 02:56 PM |
8 wide vs 6 wide | Travis | 2016/10/30 07:13 PM |
8 wide vs 6 wide | juanrga | 2016/10/31 04:55 AM |
8 wide vs 6 wide | anon | 2016/11/01 01:00 PM |
SoftMachines | none | 2016/11/02 03:57 AM |
SoftMachines | David Kanter | 2016/11/02 08:53 AM |
8 wide vs 6 wide | juanrga | 2016/11/03 12:35 PM |
8 wide vs 6 wide | Wilco | 2016/11/03 02:13 PM |
8 wide vs 6 wide | juanrga | 2016/11/03 07:35 PM |
8 wide vs 6 wide | Wilco | 2016/11/04 01:27 PM |
8 wide vs 6 wide | juanrga | 2016/11/04 06:08 PM |
8 wide vs 6 wide | Wilco | 2016/11/06 04:52 AM |
8 wide vs 6 wide | juanrga | 2016/11/06 04:56 PM |
8 wide vs 6 wide | Wilco | 2016/11/07 04:25 AM |
8 wide vs 6 wide | Aaron Spink | 2016/11/04 04:08 PM |
8 wide vs 6 wide | juanrga | 2016/11/04 06:10 PM |
Dunning-Krueger effect | Heikki Kultala | 2016/11/04 03:22 AM |
Dunning-Krueger effect | itsmydamnation | 2016/11/04 02:48 PM |
8 wide vs 6 wide | anon | 2016/11/04 03:38 AM |
8 wide vs 6 wide | juanrga | 2016/11/04 05:05 AM |
8 wide vs 6 wide | anon | 2016/11/04 06:12 AM |
8 wide vs 6 wide | Wilco | 2016/11/04 01:12 PM |
8 wide vs 6 wide | anon | 2016/11/04 02:54 PM |
8 wide vs 6 wide | juanrga | 2016/11/04 05:34 PM |
8 wide vs 6 wide | anon | 2016/11/05 02:14 AM |
8 wide vs 6 wide | juanrga | 2016/11/04 05:39 PM |
8 wide vs 6 wide | Wilco | 2016/11/06 05:15 AM |
8 wide vs 6 wide | juanrga | 2016/11/06 05:06 PM |
8 wide vs 6 wide | Wilco | 2016/11/07 03:45 AM |
8 wide vs 6 wide | David Kanter | 2016/11/07 08:43 PM |
8 wide vs 6 wide | Wilco | 2016/11/08 03:57 AM |
8 wide vs 6 wide | juanrga | 2016/11/14 12:12 PM |
8 wide vs 6 wide | Wilco | 2016/11/14 04:53 PM |
8 wide vs 6 wide | dmcq | 2016/11/15 03:17 AM |
8 wide vs 6 wide | Wilco | 2016/11/15 03:43 AM |
8 wide vs 6 wide | dmcq | 2016/11/15 04:28 AM |
1 µop per instruction is not necessary | Paul A. Clayton | 2016/11/17 12:09 PM |
8 wide vs 6 wide | juanrga | 2016/11/20 06:56 AM |
8 wide vs 6 wide | Wilco | 2016/11/21 05:54 PM |
8 wide vs 6 wide | juanrga | 2016/11/22 08:49 AM |
8 wide vs 6 wide | Wilco | 2016/11/22 03:25 PM |
8 wide vs 6 wide | Wilco | 2016/10/31 03:03 AM |
Skylake can retire 8 uops | David Kanter | 2016/10/31 12:41 AM |
Skylake can retire 8 uops | juanrga | 2016/10/31 04:15 AM |
Skylake can retire 8 uops | Alberto | 2016/11/04 07:22 AM |
8 wide vs 6 wide bogus numbers | Heikki Kultala | 2016/10/30 06:25 AM |
Broadwell includes LLC, just for comparision | anon | 2016/10/26 03:10 AM |
Pushing the hidden agenda | juanrga | 2016/10/28 03:11 AM |
Pushing the hidden agenda | anon | 2016/10/28 04:35 AM |
Neat die area comparison image | David Hess | 2016/10/22 01:26 PM |
Neat die area comparison image | anon2 | 2016/10/22 05:20 PM |
Neat die area comparison image | David Hess | 2016/10/22 10:31 PM |
Neat die area comparison image | anon2 | 2016/10/23 01:50 AM |
Neat die area comparison image | Travis | 2016/10/24 01:26 PM |
Neat die area comparison image | Maynard Handley | 2016/10/24 04:27 PM |
Neat die area comparison image | juanrga | 2016/10/25 10:02 AM |
Neat die area comparison image | David Hess | 2016/10/25 09:59 PM |
Neat die area comparison image | Travis | 2016/10/25 10:22 PM |
Neat die area comparison image | David Hess | 2016/10/25 10:37 PM |
Neat die area comparison image | Travis | 2016/10/30 06:09 PM |
Neat die area comparison image | Gabriele Svelto | 2016/10/26 02:23 AM |
Neat die area comparison image | Doug S | 2016/10/26 08:17 AM |
Neat die area comparison image | Jukka Larja | 2016/10/27 09:28 AM |
Neat die area comparison image | anon | 2016/10/26 03:32 AM |
Neat die area comparison image | juanrga | 2016/10/23 06:29 AM |
Neat die area comparison image | Matthias Waldhauer | 2016/10/22 06:12 AM |
Neat die area comparison image | juanrga | 2016/10/23 05:44 AM |
Neat die area comparison image | Gabriele Svelto | 2016/10/24 02:17 AM |