By: anon (anon.delete@this.anon.com), October 5, 2015 9:26 pm
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on October 5, 2015 12:24 pm wrote:
> juanrga (nospam.delete@this.juanrga.com) on October 5, 2015 10:46 am wrote:
> > Wouter Tinus (wouter.tinus.delete@this.gmail.com) on October 4, 2015 4:18 pm wrote:
> > > juanrga (nospam.delete@this.juanrga.com) on October 4, 2015 6:30 am wrote:
> > > > Are you kidding?
> > >
> > > No I'm not. Either your reasoning is inconsistent or you're missing some of the facts.
> > > You started by claiming: "Skylake is 8-wide (unfused uops) like Haswell," yet the former
> > > can in fact retire twice as many µops per cycle as the latter [1]. So regardless of whether
> > > you want to count unfused or fused µops, Skylake is wider than Haswell.
> > >
> > > If you want to multiply both numbers by two to get unfused µops [2], then you should
> > > rightly call Skylake a 16-wide core (2 threads x 4 µops x 2). If you don't want to
> > > do that, then perhaps you should reconsider your definition of wideness :)
> > >
> > > [1] http://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis/5
> > > [2] I'm not sure if that is fair or accurate, but not judging here
> > >
> >
> > You said that Skylake is
> >
> > > > > - 5 wide decode
> > > > > - 6 wide allocation/decoder queue
> > > > > - 6 wide ROB
> > > > > - 8 wide issue
> > > > > - 8 wide retire (4/thread)
> >
> > Using retire as metric, Skylake is then 8-wide. Haswell/Broadwell can
> > also retire up to 8 uops per cycle. Thus both are 8-wide as well.
> >
> > There is no way that Skylake can issue and retire 16 ops per cycle, and Anandtech don't say the contrary.
>
> Generally, I tend to refer to microarchitectures as X-wide, where X is determined
> by the narrowest stage. Poulson cannot sustain 12 instructions per cycle, and is
> generally a 6-wide machine that just happens to have 12 execution units.
>
> With Skylake, it's a little complex - I can see an argument for 6-wide (assuming
> hit in the uop cache) or 5-wide. Both of those are sustainable, although in the
> case of I$ hits, I'm not sure there is really enough fetch bandwidth.
I don't think it's been answered whether Skylake actually has 5 decoders. The optimization manual seems to just say that decoder can emit up to 5 uops per cycle.
Haswell decoder according to agner is:
1-1-1-1
2-1-1
3
4
Skylake may be
1-1-1-1
2-1-1-1
3
4
And/or 3-1-1. There are enough 2 and 3 fused uop instructions that might make this to be a significant effective increase in decode width.
> juanrga (nospam.delete@this.juanrga.com) on October 5, 2015 10:46 am wrote:
> > Wouter Tinus (wouter.tinus.delete@this.gmail.com) on October 4, 2015 4:18 pm wrote:
> > > juanrga (nospam.delete@this.juanrga.com) on October 4, 2015 6:30 am wrote:
> > > > Are you kidding?
> > >
> > > No I'm not. Either your reasoning is inconsistent or you're missing some of the facts.
> > > You started by claiming: "Skylake is 8-wide (unfused uops) like Haswell," yet the former
> > > can in fact retire twice as many µops per cycle as the latter [1]. So regardless of whether
> > > you want to count unfused or fused µops, Skylake is wider than Haswell.
> > >
> > > If you want to multiply both numbers by two to get unfused µops [2], then you should
> > > rightly call Skylake a 16-wide core (2 threads x 4 µops x 2). If you don't want to
> > > do that, then perhaps you should reconsider your definition of wideness :)
> > >
> > > [1] http://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis/5
> > > [2] I'm not sure if that is fair or accurate, but not judging here
> > >
> >
> > You said that Skylake is
> >
> > > > > - 5 wide decode
> > > > > - 6 wide allocation/decoder queue
> > > > > - 6 wide ROB
> > > > > - 8 wide issue
> > > > > - 8 wide retire (4/thread)
> >
> > Using retire as metric, Skylake is then 8-wide. Haswell/Broadwell can
> > also retire up to 8 uops per cycle. Thus both are 8-wide as well.
> >
> > There is no way that Skylake can issue and retire 16 ops per cycle, and Anandtech don't say the contrary.
>
> Generally, I tend to refer to microarchitectures as X-wide, where X is determined
> by the narrowest stage. Poulson cannot sustain 12 instructions per cycle, and is
> generally a 6-wide machine that just happens to have 12 execution units.
>
> With Skylake, it's a little complex - I can see an argument for 6-wide (assuming
> hit in the uop cache) or 5-wide. Both of those are sustainable, although in the
> case of I$ hits, I'm not sure there is really enough fetch bandwidth.
I don't think it's been answered whether Skylake actually has 5 decoders. The optimization manual seems to just say that decoder can emit up to 5 uops per cycle.
Haswell decoder according to agner is:
1-1-1-1
2-1-1
3
4
Skylake may be
1-1-1-1
2-1-1-1
3
4
And/or 3-1-1. There are enough 2 and 3 fused uop instructions that might make this to be a significant effective increase in decode width.