By: anon (anon.delete@this.anon.com), November 15, 2012 5:48 am
Room: Moderated Discussions
Stubabe (nospam.delete@this.nospam.com) on November 15, 2012 5:14 am wrote:
>
> > But on paper that loop should only take 2 clk/loop on sandy due to co-issue of the fused branch,
> > it takes 3 on Sandy (and 2 on Ivy) due to the loop buffer penalty of 1 clk/iteration
>
> Sorry I was thinking of MOVDQA it should be 3 on Sandy. But the loop buffer can issue max 4 uops/clk to the
> renamer + 1 penalty clock so minimum loop time is 2 clocks irrespective of what the backend does with it.
>
>
MOVAPS can only be issued to port 5 (maximum throughput is 1 / clk). Fused branch (dec+jnz) is also issued to port 5. What is the penalty you are talking about?
Another example:
loop:
xorps xmm0, xmm0
xorps xmm0, xmm0
xorps xmm0, xmm0
dec ecx
jnz loop
This loop takes only 1 clk on both Sandy and Ivy. XORPS is also port5-only instruction, but due to the "zeroing idioms" feature, which definitely uses the renaming technique, they are executed without using the backend. If the MOV elimination in Ivy were done by the renaming technique, the MOVAPS example should take only 1 clk per loop like this XORPS example.
>
> > But on paper that loop should only take 2 clk/loop on sandy due to co-issue of the fused branch,
> > it takes 3 on Sandy (and 2 on Ivy) due to the loop buffer penalty of 1 clk/iteration
>
> Sorry I was thinking of MOVDQA it should be 3 on Sandy. But the loop buffer can issue max 4 uops/clk to the
> renamer + 1 penalty clock so minimum loop time is 2 clocks irrespective of what the backend does with it.
>
>
MOVAPS can only be issued to port 5 (maximum throughput is 1 / clk). Fused branch (dec+jnz) is also issued to port 5. What is the penalty you are talking about?
Another example:
loop:
xorps xmm0, xmm0
xorps xmm0, xmm0
xorps xmm0, xmm0
dec ecx
jnz loop
This loop takes only 1 clk on both Sandy and Ivy. XORPS is also port5-only instruction, but due to the "zeroing idioms" feature, which definitely uses the renaming technique, they are executed without using the backend. If the MOV elimination in Ivy were done by the renaming technique, the MOVAPS example should take only 1 clk per loop like this XORPS example.



