By: Stubabe (nospam.delete@this.nospam.com), November 28, 2012 4:36 pm
Room: Moderated Discussions
EduardoS (no.delete@this.spam.com) on November 25, 2012 8:15 am wrote:
> Stubabe (nospam.delete@this.nospam.com) on November 25, 2012 3:08 am wrote:
> > There are only 4x Integer ALU for scalar (GPR) code, for SSE/AVX it's still only 3x issue.
> > Also I am not sure if that's 3 SSE ALU or its still split into 2xALU + 1x Shift/Mul/Logical
> > as in current designs - one IDF slide suggested it is still 2xALU another 3x...
>
> No code use only SSE/AVX, the old, 8086-instructions still used (loop control,
> address calculations, etc) and the forth ALU fits nice for them.
>
True. But, for an extreme example, I have a nice fast ( > I would say that there is the possibility of this in current designs where you are Fetch bandwidth
> > limited and your code does not use the uop cache (or it's not present), not to mention the reduction
> > of 1 clock latency. In some of my own code 3 operand Integer AVX is faster than SSE2/3/4 with MOVs.
>
> But no 3 operand for old instructions, so there may still be a lot of movs.
>
Laurent Birtz's was referencing SSE/AVX and 3-operand instructions so I was referring specifically to that. In the code I have looked at where I could avoid MOVs using AVX I saw improvements so MOVs are clearly not always free pre-haswell, but obviously they can't all be eliminated.
> Stubabe (nospam.delete@this.nospam.com) on November 25, 2012 3:08 am wrote:
> > There are only 4x Integer ALU for scalar (GPR) code, for SSE/AVX it's still only 3x issue.
> > Also I am not sure if that's 3 SSE ALU or its still split into 2xALU + 1x Shift/Mul/Logical
> > as in current designs - one IDF slide suggested it is still 2xALU another 3x...
>
> No code use only SSE/AVX, the old, 8086-instructions still used (loop control,
> address calculations, etc) and the forth ALU fits nice for them.
>
True. But, for an extreme example, I have a nice fast ( > I would say that there is the possibility of this in current designs where you are Fetch bandwidth
> > limited and your code does not use the uop cache (or it's not present), not to mention the reduction
> > of 1 clock latency. In some of my own code 3 operand Integer AVX is faster than SSE2/3/4 with MOVs.
>
> But no 3 operand for old instructions, so there may still be a lot of movs.
>
Laurent Birtz's was referencing SSE/AVX and 3-operand instructions so I was referring specifically to that. In the code I have looked at where I could avoid MOVs using AVX I saw improvements so MOVs are clearly not always free pre-haswell, but obviously they can't all be eliminated.



