Data-dependent instruction latency

By: Travis (, August 4, 2018 3:33 pm
Room: Moderated Discussions
Peter E. Fry ( on August 4, 2018 7:14 am wrote:
> Going back a few years, I had a test case on the AMD K8/K10 where MUL had three distinct latencies: one factor
> = 0 or 1; one factor = power of 2; everything else. I discovered it via some poorly-formed test data (all
> 0s), which made me think I had some magically fast code. My K10 board is on a shelf in the closet, so I can't
> check my sanity at the moment (yes, it is suspect). I don't have any later AMD chips to test.

That's interesting. I wonder if that could even work efficiently today? My impression is that modern chips need/want to know the latency up front, at schedule time, so that dependent operations can be woken up on the exactly right cycle to read the result off of the bypass network. That makes variable latency instructions problematic, since you'd either need to wake them up on multiple consecutive cycles (if the range of latencies were small) or wait for the result to be written back to the register file and then wake them, adding latency and power. I think... (corrections welcome).

Modern chips to still have variable latency ALU ops, such as integer divide on Intel, but these are already long-latency instructions (20+ cycles), often micro-coded, so waking up dependent operations on exactly the right cycle probably isn't important.

> Are these sorts of things documented in one place somewhere?

Not sure exactly what the scope of "these sorts of things" is, but if you mean performance quirks that aren't covered in say the model you'd get by reading Agner and the Intel and AMD optimization guides, then I've recently started collecting some of them here.

Note that there are a ton of other "quirks", like 4k aliasing, cache associativity, false dependencies on popcnt, branch prediction, cross-thread contention for memory, but many are more or less well covered so I don't include them there. Actually I don't really have a specific guideline for what should go in there: it's partly a list for myself to track some of this stuff.

> Not really related, but I've run clean into the limitations of static analysis (staring at code).

I find static analysis mostly still works for relatively contained examples that don't involve L3 or main memory. That is, there are few gross deviations from what a good static analysis would achieve. When you do find a gross deviation, it's a chance to update the model: sometimes it leads to a new undocumented quirk.

> I have two mysteries (at the moment) (BSF running faster than it should on Haswell; two sets
> of sequences compiled on GCC and Clang with identical instruction counts that run... differently
> than I would expect) - apparently performance counters are the only way these days.

Even performance counters don't help if some instruction it just taking longer and it doesn't show up as an extra uop or whatever.

If your example is non-proprietary and reduced[1], why not post it here? Maybe someone already knows what's going on, and if not it could be a new mystery to solve.

BTW, I expect BSF to run with 3 cycle latency and 1 cycle throughput, with a dependency on the destination register (i.e., the destination is read-write like add, not write-only like mov).

[1] A small example that shows the weird effect without an ton of extraneous stuff.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
TIL: simple vs complex addressing is resolved at rename time (probably)Travis2018/08/03 01:34 PM
  TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 01:40 AM
    TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 05:05 AM
      TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 07:00 AM
        TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 08:32 AM
          TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 09:48 AM
            TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 10:19 AM
  Data-dependent instruction latencyPeter E. Fry2018/08/04 07:14 AM
    ... or a compiler optimizing aggressively?Heikki Kultala2018/08/04 08:13 AM
      ... or a compiler optimizing aggressively?Peter E. Fry2018/08/04 08:53 AM
    Data-dependent instruction latencyTravis2018/08/04 03:33 PM
      Data-dependent instruction latencyPeter E. Fry2018/08/05 09:13 AM
        Data-dependent instruction latencyTravis2018/08/05 04:55 PM
          Data-dependent instruction latencyPeter E. Fry2018/08/06 07:34 AM
            Data-dependent instruction latencyTravis2018/08/06 05:10 PM
              Data-dependent instruction latencyPeter E. Fry2018/08/07 07:09 AM
                Data-dependent instruction latencyPeter E. Fry2018/08/07 07:11 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊