By: Felid (Felid.delete@this.mailinator.com), November 15, 2012 3:19 pm
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on November 15, 2012 5:48 am wrote:
> MOVAPS can only be issued to port 5 (maximum throughput is 1 / clk). Fused branch
> (dec+jnz) is also issued to port 5. What is the penalty you are talking about?
>
> Another example:
>
> loop:
> xorps xmm0, xmm0
> xorps xmm0, xmm0
> xorps xmm0, xmm0
> dec ecx
> jnz loop
>
> This loop takes only 1 clk on both Sandy and Ivy. XORPS is also port5-only instruction, but
> due to the "zeroing idioms" feature, which definitely uses the renaming technique, they are
> executed without using the backend. If the MOV elimination in Ivy were done by the renaming
> technique, the MOVAPS example should take only 1 clk per loop like this XORPS example.
Try to replace MOVAPS #2 with «xmm1, xmm2» (1-way dependence), and then to «xmm2, xmm3» (no dependence). To remove possible port issue bottleneck, also worth to test with GPR's, but not on 8- or 16-bit ones :) This'll give more info on the work logic.
> MOVAPS can only be issued to port 5 (maximum throughput is 1 / clk). Fused branch
> (dec+jnz) is also issued to port 5. What is the penalty you are talking about?
>
> Another example:
>
> loop:
> xorps xmm0, xmm0
> xorps xmm0, xmm0
> xorps xmm0, xmm0
> dec ecx
> jnz loop
>
> This loop takes only 1 clk on both Sandy and Ivy. XORPS is also port5-only instruction, but
> due to the "zeroing idioms" feature, which definitely uses the renaming technique, they are
> executed without using the backend. If the MOV elimination in Ivy were done by the renaming
> technique, the MOVAPS example should take only 1 clk per loop like this XORPS example.
Try to replace MOVAPS #2 with «xmm1, xmm2» (1-way dependence), and then to «xmm2, xmm3» (no dependence). To remove possible port issue bottleneck, also worth to test with GPR's, but not on 8- or 16-bit ones :) This'll give more info on the work logic.



