By: Stuart (nospam.delete@this.nospam.com), November 15, 2012 5:04 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on November 15, 2012 1:23 am wrote:
> Felid (Felid.delete@this.mailinator.com) on November 15, 2012 12:49 am wrote:
> > > Bulldozer actually eliminates MOVs (for SIMD only) using the register renaming technique as
> > > you described. But in Ivy Bridge, as long as I've measured in the actual processor, it shows
> > > a behavior that it fuses a MOV instruction with a subsequent dependent instruction for MOV
> > > elimination (when there is no MOV-dependent instruction, MOV is not eliminated at all).
> > >
> > > Fusion seems to be done in uop domain because non-adjacent instructions can be fused.
> >
> > It doesn't makes sense. There can be many reads of mov's destination, so every one on these
> > mops should get their source register replaced with (link to) original. This can't be done
> > with fusion (2 instructions —> 1 mop), but perfectly apply to renaming logic.
>
> I don't mean it is a macro-fusion.
>
> For example,
>
> loop:
> movaps xmm1, xmm0
> movaps xmm0, xmm1
> dec ecx
> jnz loop
>
> This loop takes 3clk/loop in Sandy Bridge, 2clk/loop in Ivy Bridge. If MOV elimination were totally
> done by renaming logic this loop should take only 1 cycle (only dec+jnz is issued to port 5) in Ivy.
> But actually it takes 2 cycles and this means at least one movaps is issued to port 5 per loop.
But on paper that loop should only take 2 clk/loop on sandy due to co-issue of the fused branch, it takes 3 on Sandy (and 2 on Ivy) due to the loop buffer penalty of 1 clk/iteration
> Felid (Felid.delete@this.mailinator.com) on November 15, 2012 12:49 am wrote:
> > > Bulldozer actually eliminates MOVs (for SIMD only) using the register renaming technique as
> > > you described. But in Ivy Bridge, as long as I've measured in the actual processor, it shows
> > > a behavior that it fuses a MOV instruction with a subsequent dependent instruction for MOV
> > > elimination (when there is no MOV-dependent instruction, MOV is not eliminated at all).
> > >
> > > Fusion seems to be done in uop domain because non-adjacent instructions can be fused.
> >
> > It doesn't makes sense. There can be many reads of mov's destination, so every one on these
> > mops should get their source register replaced with (link to) original. This can't be done
> > with fusion (2 instructions —> 1 mop), but perfectly apply to renaming logic.
>
> I don't mean it is a macro-fusion.
>
> For example,
>
> loop:
> movaps xmm1, xmm0
> movaps xmm0, xmm1
> dec ecx
> jnz loop
>
> This loop takes 3clk/loop in Sandy Bridge, 2clk/loop in Ivy Bridge. If MOV elimination were totally
> done by renaming logic this loop should take only 1 cycle (only dec+jnz is issued to port 5) in Ivy.
> But actually it takes 2 cycles and this means at least one movaps is issued to port 5 per loop.
But on paper that loop should only take 2 clk/loop on sandy due to co-issue of the fused branch, it takes 3 on Sandy (and 2 on Ivy) due to the loop buffer penalty of 1 clk/iteration



