By: foobar (foobar.delete@this.foobar.foobar), August 4, 2018 9:48 am
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on August 4, 2018 8:32 am wrote:
> foobar (foobar.delete@this.foobar.foobar) on August 4, 2018 7:00 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on August 4, 2018 5:05 am wrote:
> > > foobar (foobar.delete@this.foobar.foobar) on August 4, 2018 1:40 am wrote:
> > > > Travis (travis.downs.delete@this.gmail.com) on August 3, 2018 1:34 pm wrote:
> > > > > What I learned yesterday that the distinction between simple and complex addressing apparently
> > > > > happens dynamically after decode. In particular, something like
> > > > > complex addressing, but if rsi is zero at runtime and was set to zero by a zeroing idiom
> > > > > the latency is as-if you were using simple addressing (4 cycles for GP loads).
> > > >
> > > > This made me wonder: would the same be possible for conditional branches? That is, if you would use
> > > > the zero idiom on a register and perform a macro-fused compare-and-branch operation not involving
> > > > other registers on it, the executed uop would actually be an unconditional jump, or even a nop if
> > > > the conditional branch would not be taken? I guess this would be visible mostly through the branch
> > > > predictor since predicted taken branches have the same performance as unconditional jumps...
> > >
> > > It would be difficult to implement since the zeroing idiom will only be recognized at the
> > > rename stage and at best at the decoders (if the instructions were adjacent) whereas the
> > > fetch stage and therefore the branch predictor are running ahead of even the decoders.
> > >
> > > The zero idiom would disappear anyway and a branch has to
> > > execute to preserve the program. In theory you could
> > > save a uop in the unconditional not-taken case but unless
> > > that always happens (why is that even in the program?)
> > > some check still needs to be performed and you'd rather not move that into the rename or decode stage.
> > > If you've already got macro-fusion there is no real improvement in the always taken case. Fused uop
> > > that compares with the zero register or unfused uop for an unconditional jump shouldn't matter.
> > >
> > > So theoretically possible, but practically useless because you still need the branch predictor anyway.
> >
> > I admit this would be a pretty esoteric case. Nonetheless, there's a theoretical scenario: a function
> > which is littered with conditional blocks executed or not
> > executed according to a flag value. If the theoretical
> > construct would be implemented on the CPU, one could avoid both use and "pollution" of the branch predictor
> > in many cases by zeroing the flag using the zero idiom ahead of calling this code. Mispredictions wouldn't
> > occur on branches related to this flag, and probably all predicted variants would strongly predict the
> > non-zero condition. (Of course, this wouldn't be very practical in the sense there is no "one idiom" for
> > traditional Intel registers, which would make this behaviour quite asymmetric.)
> >
>
> You'd still need an entry for unconditional jumps in the branch predictor or
> fetch won't work. I just don't see a way how you could get around that.
> Also what happens if the flag changes? If it never does the code could be simplified
> and if it does a branch predictor that recognizes the correlation would perform
> well enough without any added logic that notifies it of the flag status.
I've seen plenty of C functions, thousands of lines long, where lots of special cases are handled repeatedly depending on a very small set of flags which could be known long before calling the function, and which don't change their value in the middle. These functions are repeatedly called for different inputs, and specialization for different inputs wouldn't probably do good for the instruction cache hit rate.
But yes, I understand it is unlikely this would be feasible to implement in hardware that easily, or that it would be worthwhile in general. Just that this question popped to my mind because it's pretty surprising that method of address calculation seems to depend on an input register originating from a zero idiom...
> foobar (foobar.delete@this.foobar.foobar) on August 4, 2018 7:00 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on August 4, 2018 5:05 am wrote:
> > > foobar (foobar.delete@this.foobar.foobar) on August 4, 2018 1:40 am wrote:
> > > > Travis (travis.downs.delete@this.gmail.com) on August 3, 2018 1:34 pm wrote:
> > > > > What I learned yesterday that the distinction between simple and complex addressing apparently
> > > > > happens dynamically after decode. In particular, something like
[rdx + rsi*4]
looks like > > > > > complex addressing, but if rsi is zero at runtime and was set to zero by a zeroing idiom
> > > > > the latency is as-if you were using simple addressing (4 cycles for GP loads).
> > > >
> > > > This made me wonder: would the same be possible for conditional branches? That is, if you would use
> > > > the zero idiom on a register and perform a macro-fused compare-and-branch operation not involving
> > > > other registers on it, the executed uop would actually be an unconditional jump, or even a nop if
> > > > the conditional branch would not be taken? I guess this would be visible mostly through the branch
> > > > predictor since predicted taken branches have the same performance as unconditional jumps...
> > >
> > > It would be difficult to implement since the zeroing idiom will only be recognized at the
> > > rename stage and at best at the decoders (if the instructions were adjacent) whereas the
> > > fetch stage and therefore the branch predictor are running ahead of even the decoders.
> > >
> > > The zero idiom would disappear anyway and a branch has to
> > > execute to preserve the program. In theory you could
> > > save a uop in the unconditional not-taken case but unless
> > > that always happens (why is that even in the program?)
> > > some check still needs to be performed and you'd rather not move that into the rename or decode stage.
> > > If you've already got macro-fusion there is no real improvement in the always taken case. Fused uop
> > > that compares with the zero register or unfused uop for an unconditional jump shouldn't matter.
> > >
> > > So theoretically possible, but practically useless because you still need the branch predictor anyway.
> >
> > I admit this would be a pretty esoteric case. Nonetheless, there's a theoretical scenario: a function
> > which is littered with conditional blocks executed or not
> > executed according to a flag value. If the theoretical
> > construct would be implemented on the CPU, one could avoid both use and "pollution" of the branch predictor
> > in many cases by zeroing the flag using the zero idiom ahead of calling this code. Mispredictions wouldn't
> > occur on branches related to this flag, and probably all predicted variants would strongly predict the
> > non-zero condition. (Of course, this wouldn't be very practical in the sense there is no "one idiom" for
> > traditional Intel registers, which would make this behaviour quite asymmetric.)
> >
>
> You'd still need an entry for unconditional jumps in the branch predictor or
> fetch won't work. I just don't see a way how you could get around that.
> Also what happens if the flag changes? If it never does the code could be simplified
> and if it does a branch predictor that recognizes the correlation would perform
> well enough without any added logic that notifies it of the flag status.
I've seen plenty of C functions, thousands of lines long, where lots of special cases are handled repeatedly depending on a very small set of flags which could be known long before calling the function, and which don't change their value in the middle. These functions are repeatedly called for different inputs, and specialization for different inputs wouldn't probably do good for the instruction cache hit rate.
But yes, I understand it is unlikely this would be feasible to implement in hardware that easily, or that it would be worthwhile in general. Just that this question popped to my mind because it's pretty surprising that method of address calculation seems to depend on an input register originating from a zero idiom...