TIL: simple vs complex addressing is resolved at rename time (probably)

By: anon (spam.delete.delete@this.this.spam.com), August 4, 2018 8:32 am
Room: Moderated Discussions
foobar (foobar.delete@this.foobar.foobar) on August 4, 2018 7:00 am wrote:
> anon (spam.delete.delete@this.this.spam.com) on August 4, 2018 5:05 am wrote:
> > foobar (foobar.delete@this.foobar.foobar) on August 4, 2018 1:40 am wrote:
> > > Travis (travis.downs.delete@this.gmail.com) on August 3, 2018 1:34 pm wrote:
> > > > What I learned yesterday that the distinction between simple and complex addressing apparently
> > > > happens dynamically after decode. In particular, something like [rdx + rsi*4] looks like
> > > > complex addressing, but if rsi is zero at runtime and was set to zero by a zeroing idiom
> > > > the latency is as-if you were using simple addressing (4 cycles for GP loads).
> > >
> > > This made me wonder: would the same be possible for conditional branches? That is, if you would use
> > > the zero idiom on a register and perform a macro-fused compare-and-branch operation not involving
> > > other registers on it, the executed uop would actually be an unconditional jump, or even a nop if
> > > the conditional branch would not be taken? I guess this would be visible mostly through the branch
> > > predictor since predicted taken branches have the same performance as unconditional jumps...
> >
> > It would be difficult to implement since the zeroing idiom will only be recognized at the
> > rename stage and at best at the decoders (if the instructions were adjacent) whereas the
> > fetch stage and therefore the branch predictor are running ahead of even the decoders.
> >
> > The zero idiom would disappear anyway and a branch has to
> > execute to preserve the program. In theory you could
> > save a uop in the unconditional not-taken case but unless
> > that always happens (why is that even in the program?)
> > some check still needs to be performed and you'd rather not move that into the rename or decode stage.
> > If you've already got macro-fusion there is no real improvement in the always taken case. Fused uop
> > that compares with the zero register or unfused uop for an unconditional jump shouldn't matter.
> >
> > So theoretically possible, but practically useless because you still need the branch predictor anyway.
>
> I admit this would be a pretty esoteric case. Nonetheless, there's a theoretical scenario: a function
> which is littered with conditional blocks executed or not executed according to a flag value. If the theoretical
> construct would be implemented on the CPU, one could avoid both use and "pollution" of the branch predictor
> in many cases by zeroing the flag using the zero idiom ahead of calling this code. Mispredictions wouldn't
> occur on branches related to this flag, and probably all predicted variants would strongly predict the
> non-zero condition. (Of course, this wouldn't be very practical in the sense there is no "one idiom" for
> traditional Intel registers, which would make this behaviour quite asymmetric.)
>

You'd still need an entry for unconditional jumps in the branch predictor or fetch won't work. I just don't see a way how you could get around that.
Also what happens if the flag changes? If it never does the code could be simplified and if it does a branch predictor that recognizes the correlation would perform well enough without any added logic that notifies it of the flag status.

Branching often on a single flag would also be rare.
I can see the use for it with predication. Maybe we'll see an optimization for AVX-512 where an op with zeroed mask turns into a nop.

> Yeah, it's a wild, and probably a stupid idea. Probably quite nonsensical and untrivial to implement. Another
> similarly baffling idea is to macro-fuse an arithmetic operation with a conditional branch which is always
> taken (thanks to reasoning on invariants of flags result on the operation). This would allow performing four
> ALU ops on a cycle *and* an unconditional jump. There the "why" question would of course be related to the
> fact that typical hot code probably wouldn't have that unconditional jump in the first place...

Yeah it would either be turned into straight line code or you'd be touching the flag multiple times (e.g. counter) and figuring out which instruction will actually change the flag becomes a lot more difficult.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
TIL: simple vs complex addressing is resolved at rename time (probably)Travis2018/08/03 01:34 PM
  TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 01:40 AM
    TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 05:05 AM
      TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 07:00 AM
        TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 08:32 AM
          TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 09:48 AM
            TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 10:19 AM
  Data-dependent instruction latencyPeter E. Fry2018/08/04 07:14 AM
    ... or a compiler optimizing aggressively?Heikki Kultala2018/08/04 08:13 AM
      ... or a compiler optimizing aggressively?Peter E. Fry2018/08/04 08:53 AM
    Data-dependent instruction latencyTravis2018/08/04 03:33 PM
      Data-dependent instruction latencyPeter E. Fry2018/08/05 09:13 AM
        Data-dependent instruction latencyTravis2018/08/05 04:55 PM
          Data-dependent instruction latencyPeter E. Fry2018/08/06 07:34 AM
            Data-dependent instruction latencyTravis2018/08/06 05:10 PM
              Data-dependent instruction latencyPeter E. Fry2018/08/07 07:09 AM
                Data-dependent instruction latencyPeter E. Fry2018/08/07 07:11 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?