By: Felid (Felid.delete@this.mailinator.com), November 15, 2012 2:50 pm
Room: Moderated Discussions
Stubabe (nospam.delete@this.nospam.com) on November 15, 2012 5:14 am wrote:
>
> > But on paper that loop should only take 2 clk/loop on sandy due to co-issue of the fused branch,
> > it takes 3 on Sandy (and 2 on Ivy) due to the loop buffer penalty of 1 clk/iteration
>
> Sorry I was thinking of MOVDQA it should be 3 on Sandy. But the loop buffer can issue max 4 uops/clk to the
> renamer + 1 penalty clock so minimum loop time is 2 clocks irrespective of what the backend does with it.
SB's IDQ (instruction decode buffer — official name for it) is slightly enhanced: the penalty for jump in loop mode (when LSD logic is active) is 0, not 1 clk (as in Nhm). So it's possible to read mops across iterations, like: dec + jnz + mov + mov (in a single clk). However, the bottleneck here is BTB: shortest time to «predict» the address (even for jmp and call) is 2 clk.
>
> > But on paper that loop should only take 2 clk/loop on sandy due to co-issue of the fused branch,
> > it takes 3 on Sandy (and 2 on Ivy) due to the loop buffer penalty of 1 clk/iteration
>
> Sorry I was thinking of MOVDQA it should be 3 on Sandy. But the loop buffer can issue max 4 uops/clk to the
> renamer + 1 penalty clock so minimum loop time is 2 clocks irrespective of what the backend does with it.
SB's IDQ (instruction decode buffer — official name for it) is slightly enhanced: the penalty for jump in loop mode (when LSD logic is active) is 0, not 1 clk (as in Nhm). So it's possible to read mops across iterations, like: dec + jnz + mov + mov (in a single clk). However, the bottleneck here is BTB: shortest time to «predict» the address (even for jmp and call) is 2 clk.



