By: anon (spam.delete.delete@this.this.spam.com), August 4, 2018 5:05 am
Room: Moderated Discussions
foobar (foobar.delete@this.foobar.foobar) on August 4, 2018 1:40 am wrote:
> Travis (travis.downs.delete@this.gmail.com) on August 3, 2018 1:34 pm wrote:
> > What I learned yesterday that the distinction between simple and complex addressing apparently
> > happens dynamically after decode. In particular, something like
> > complex addressing, but if rsi is zero at runtime and was set to zero by a zeroing idiom
> > the latency is as-if you were using simple addressing (4 cycles for GP loads).
>
> This made me wonder: would the same be possible for conditional branches? That is, if you would use
> the zero idiom on a register and perform a macro-fused compare-and-branch operation not involving
> other registers on it, the executed uop would actually be an unconditional jump, or even a nop if
> the conditional branch would not be taken? I guess this would be visible mostly through the branch
> predictor since predicted taken branches have the same performance as unconditional jumps...
It would be difficult to implement since the zeroing idiom will only be recognized at the rename stage and at best at the decoders (if the instructions were adjacent) whereas the fetch stage and therefore the branch predictor are running ahead of even the decoders.
The zero idiom would disappear anyway and a branch has to execute to preserve the program. In theory you could save a uop in the unconditional not-taken case but unless that always happens (why is that even in the program?) some check still needs to be performed and you'd rather not move that into the rename or decode stage.
If you've already got macro-fusion there is no real improvement in the always taken case. Fused uop that compares with the zero register or unfused uop for an unconditional jump shouldn't matter.
So theoretically possible, but practically useless because you still need the branch predictor anyway.
> Travis (travis.downs.delete@this.gmail.com) on August 3, 2018 1:34 pm wrote:
> > What I learned yesterday that the distinction between simple and complex addressing apparently
> > happens dynamically after decode. In particular, something like
[rdx + rsi*4]
looks like > > complex addressing, but if rsi is zero at runtime and was set to zero by a zeroing idiom
> > the latency is as-if you were using simple addressing (4 cycles for GP loads).
>
> This made me wonder: would the same be possible for conditional branches? That is, if you would use
> the zero idiom on a register and perform a macro-fused compare-and-branch operation not involving
> other registers on it, the executed uop would actually be an unconditional jump, or even a nop if
> the conditional branch would not be taken? I guess this would be visible mostly through the branch
> predictor since predicted taken branches have the same performance as unconditional jumps...
It would be difficult to implement since the zeroing idiom will only be recognized at the rename stage and at best at the decoders (if the instructions were adjacent) whereas the fetch stage and therefore the branch predictor are running ahead of even the decoders.
The zero idiom would disappear anyway and a branch has to execute to preserve the program. In theory you could save a uop in the unconditional not-taken case but unless that always happens (why is that even in the program?) some check still needs to be performed and you'd rather not move that into the rename or decode stage.
If you've already got macro-fusion there is no real improvement in the always taken case. Fused uop that compares with the zero register or unfused uop for an unconditional jump shouldn't matter.
So theoretically possible, but practically useless because you still need the branch predictor anyway.