By: anon (anon.delete@this.anon.com), November 15, 2012 1:23 am
Room: Moderated Discussions
Felid (Felid.delete@this.mailinator.com) on November 15, 2012 12:49 am wrote:
> > Bulldozer actually eliminates MOVs (for SIMD only) using the register renaming technique as
> > you described. But in Ivy Bridge, as long as I've measured in the actual processor, it shows
> > a behavior that it fuses a MOV instruction with a subsequent dependent instruction for MOV
> > elimination (when there is no MOV-dependent instruction, MOV is not eliminated at all).
> >
> > Fusion seems to be done in uop domain because non-adjacent instructions can be fused.
>
> It doesn't makes sense. There can be many reads of mov's destination, so every one on these
> mops should get their source register replaced with (link to) original. This can't be done
> with fusion (2 instructions —> 1 mop), but perfectly apply to renaming logic.
I don't mean it is a macro-fusion.
For example,
loop:
movaps xmm1, xmm0
movaps xmm0, xmm1
dec ecx
jnz loop
This loop takes 3clk/loop in Sandy Bridge, 2clk/loop in Ivy Bridge. If MOV elimination were totally done by renaming logic this loop should take only 1 cycle (only dec+jnz is issued to port 5) in Ivy. But actually it takes 2 cycles and this means at least one movaps is issued to port 5 per loop.
> > Bulldozer actually eliminates MOVs (for SIMD only) using the register renaming technique as
> > you described. But in Ivy Bridge, as long as I've measured in the actual processor, it shows
> > a behavior that it fuses a MOV instruction with a subsequent dependent instruction for MOV
> > elimination (when there is no MOV-dependent instruction, MOV is not eliminated at all).
> >
> > Fusion seems to be done in uop domain because non-adjacent instructions can be fused.
>
> It doesn't makes sense. There can be many reads of mov's destination, so every one on these
> mops should get their source register replaced with (link to) original. This can't be done
> with fusion (2 instructions —> 1 mop), but perfectly apply to renaming logic.
I don't mean it is a macro-fusion.
For example,
loop:
movaps xmm1, xmm0
movaps xmm0, xmm1
dec ecx
jnz loop
This loop takes 3clk/loop in Sandy Bridge, 2clk/loop in Ivy Bridge. If MOV elimination were totally done by renaming logic this loop should take only 1 cycle (only dec+jnz is issued to port 5) in Ivy. But actually it takes 2 cycles and this means at least one movaps is issued to port 5 per loop.



