By: anon (anon.delete@this.anon.com), October 7, 2015 6:21 pm
Room: Moderated Discussions
juanrga (nospam.delete@this.juanrga.com) on October 7, 2015 11:46 am wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 7, 2015 5:49 am wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on October 5, 2015 12:24 pm wrote:
> > >
> > > With Skylake, it's a little complex - I can see an argument for 6-wide (assuming
> > > hit in the uop cache) or 5-wide. Both of those are sustainable, although in the
> > > case of I$ hits, I'm not sure there is really enough fetch bandwidth.
> >
> > I'd be very surprised if Skylake is "5-wide decode" in the sense
> > that most people think of it and it seems to be talked about here.
> >
> > In fact, Intel doesn't even say it's 5-wide. Intel says it's "5 uops max". It's possible
> > that that never means "five instructions" - it might be that you only get five uops
> > when one or more of the decoders end up decoding an instruction into multiple uops.
> >
>
> Haswell decoder can decode up to 16 bytes per cycle of complex instructions into up to 4 fused-uops,
> which are supplied down the pipeline. Since fused-uops are split into uops (Haswell can execute
> up to 8 uops per cycle), the decoder peak is higher than 4 when we are counting uops.
You don't measure decoder width in uops though. You wouldn't say Haswell has a 4-wide decoder for complex instructions, for example. It has a single decoder for those which can emit up to 4 uops per cycle. And on the other hand, you could say Haswell has a 5 wide decoder because of instruction fusion, even if those peak cases still can only result in 4 uops per cycle being emitted. So it's not tied to uops one way or the other.
Of course that's all academic, but still important to try keeping to existing terminology as much as possible.
>
> I am not following Skylake details because I am awaiting for a more detailed description of the
> arch. But "5-wide decode" for Skylake sounds that can supply up to 5 fused-uops per cycle.
In the Intel manual they were careful to say that it could emit 5 uops per cycle up from 4. They did not say "5-wide decode" (unless I missed it), which would mean something else.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 7, 2015 5:49 am wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on October 5, 2015 12:24 pm wrote:
> > >
> > > With Skylake, it's a little complex - I can see an argument for 6-wide (assuming
> > > hit in the uop cache) or 5-wide. Both of those are sustainable, although in the
> > > case of I$ hits, I'm not sure there is really enough fetch bandwidth.
> >
> > I'd be very surprised if Skylake is "5-wide decode" in the sense
> > that most people think of it and it seems to be talked about here.
> >
> > In fact, Intel doesn't even say it's 5-wide. Intel says it's "5 uops max". It's possible
> > that that never means "five instructions" - it might be that you only get five uops
> > when one or more of the decoders end up decoding an instruction into multiple uops.
> >
>
> Haswell decoder can decode up to 16 bytes per cycle of complex instructions into up to 4 fused-uops,
> which are supplied down the pipeline. Since fused-uops are split into uops (Haswell can execute
> up to 8 uops per cycle), the decoder peak is higher than 4 when we are counting uops.
You don't measure decoder width in uops though. You wouldn't say Haswell has a 4-wide decoder for complex instructions, for example. It has a single decoder for those which can emit up to 4 uops per cycle. And on the other hand, you could say Haswell has a 5 wide decoder because of instruction fusion, even if those peak cases still can only result in 4 uops per cycle being emitted. So it's not tied to uops one way or the other.
Of course that's all academic, but still important to try keeping to existing terminology as much as possible.
>
> I am not following Skylake details because I am awaiting for a more detailed description of the
> arch. But "5-wide decode" for Skylake sounds that can supply up to 5 fused-uops per cycle.
In the Intel manual they were careful to say that it could emit 5 uops per cycle up from 4. They did not say "5-wide decode" (unless I missed it), which would mean something else.