TIL: simple vs complex addressing is resolved at rename time (probably)

By: Travis (travis.downs.delete@this.gmail.com), August 3, 2018 1:34 pm
Room: Moderated Discussions
Actually I learned it yesterday, but I'm posting it today.

The last decade of Intel chips have a distinction between simple and complex addresssing. Simple addressing is anything like [reg + offset] where offset is in [0, 2047]. That is, non-indexed addressed with a small positive offset.

Complex addressing is everything else, and it gets a one cycle latency penalty. So, for example, the minimum L1 load-to-load latency is 4 cycles, but this is for simple addressing, complex addressing takes 5 cycles.

Nothing new there, just background.

What I learned yesterday that the distinction between simple and complex addressing apparently happens dynamically after decode. In particular, something like [rdx + rsi*4] looks like complex addressing, but if rsi is zero at runtime and was set to zero by a zeroing idiom the latency is as-if you were using simple addressing (4 cycles for GP loads).

To be concrete, the following pointer-chase runs in 5 cycles per iteration:


mov esi, 0
chase:
mov rdi, [rdi + rsi*4]
cmp ... ; some exit condition
jcc chase


... but the following pointer-chase runs in 4 cycles per iteration, where the whole difference is how rsi is zeroed before the loop:


xor esi, esi
chase:
mov rdi, [rdi + rsi*4]
cmp ... ; some exit condition
jcc chase


So apparently the distinction between simple and complex happens (probably) at rename time, which notices the index register pointing to the zero register, and sends it into the scheduler as latency-4 instruction.

There are many instructions that already have different latencies (and uop counts) depending on some value, but the existing examples that I know of are all based on immediate values in the instruction so can be sorted out by the decoders. Examples include adc with a 0 immediate (1 uop vs 2 on SnB to Haswell), certain shift instructions with 0 or 1 immediate, etc.

Is this useful information in practice for optimization?

Probably not, or only very rarely. In principle, with a dynamic index value you could do an explicit check of the index value before ending a loop where the latency matters, and explicitly zero out the register (an architectural no-op, but not a uarch-no op), like:


test rsi,rsi
jnz skip
xor esi,esi ; pointless in principle but saves a cycle of latency on Intel
skip:
...


First of all, chained pointer chases are more likely to have simple addressing mode with base + offset, and not use indexing modes at all. When you are using indexing modes, this trick only works if the pointer chase is updating the base register but not the index register. Finally, this only works if the index register being zero is a common case.

One could imagine a scenario where this is true: some generic list traversal code where you pass in the offset of the field that contains the "next" pointer: this will probably generate an indexed load on most compilers, like this. In that case, the xor hack would help. Of course, you could have just specialized the code with an explicit check for zero up front and dispatch to a simpler loop in that case, to get the same benefit (and it would be faster on other platforms too).

So it's more a curiosity than anything, I think.

Finally, this is a source of mysterious "mode shifts" in performance: where some loop is running at one speed and then suddenly shifts to a different speed without any control flow change in the running application at all: what happens is that any interrupt will save and restore the registers, but the restore won't use a "zeroing idiom" to restore zero of course: it just uses a load from memory or XRESTOR or whatever, so suddenly your loop slows down because the renamed reg doesn't point at the zero reg anymore.

Similarly if you make an occasional function call and you are using a callee saved reg as the index in the caller: the callee will save it, but using push/pop or similar, breaking the zero-pointer state and then the other code slows down.

 Next Post in Thread >
TopicPosted ByDate
TIL: simple vs complex addressing is resolved at rename time (probably)Travis2018/08/03 01:34 PM
  TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 01:40 AM
    TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 05:05 AM
      TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 07:00 AM
        TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 08:32 AM
          TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 09:48 AM
            TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 10:19 AM
  Data-dependent instruction latencyPeter E. Fry2018/08/04 07:14 AM
    ... or a compiler optimizing aggressively?Heikki Kultala2018/08/04 08:13 AM
      ... or a compiler optimizing aggressively?Peter E. Fry2018/08/04 08:53 AM
    Data-dependent instruction latencyTravis2018/08/04 03:33 PM
      Data-dependent instruction latencyPeter E. Fry2018/08/05 09:13 AM
        Data-dependent instruction latencyTravis2018/08/05 04:55 PM
          Data-dependent instruction latencyPeter E. Fry2018/08/06 07:34 AM
            Data-dependent instruction latencyTravis2018/08/06 05:10 PM
              Data-dependent instruction latencyPeter E. Fry2018/08/07 07:09 AM
                Data-dependent instruction latencyPeter E. Fry2018/08/07 07:11 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?