Data-dependent instruction latency

By: Travis (travis.downs.delete@this.gmail.com), August 6, 2018 5:10 pm
Room: Moderated Discussions
Peter E. Fry (pfry.delete@this.tailbone.net) on August 6, 2018 7:34 am wrote:
> Travis (travis.downs.delete@this.gmail.com) on August 5, 2018 4:55 pm wrote:
> > Do you have some actual numbers for that you are seeing,
> > such as nanoseconds or clocks per Mask_to_Size call?
>
> I do not, not with this setup. I'm doing most of this on my Windows TV/game machine, and I'm in quick-and-dirty
> mode. I need to get the servers off my desk and a real workstation in their place.

Right but you mentioned BSF running too fast, so I was wondering what actual numbers you say. Presumably you ran the code you posted, which does spit out numbers?

>
> [...]
> > You could cut out some instructions with an unroll which
> > would remove some of the loop overhead like add
> > $0x10,%rcx
and the cmp/jne. Indeed, when I add -funroll-loops to the command line I get 3.3 cycles/elem.
>
> Interesting! I'll have to try it. Much of the time I'm locked into a serial form due
> to variances in the data, and uop caches make unrolling generally less profitable.

Note that your benchmark is totally parallel: which is why latencies barely matter (except in the false-dep case) and the analysis is all about throughput and hence total uops and port pressure. The benchmark misses in L1 but hits in L2, but this also doesn't matter much because it is parallel.

If your real world case is serial, make your test serial: this is as easy and feeding the output of one call into the next. The feeding doesn't actually have to change the input value: a general approach is to take the output value, AND it with zero, and the ADD it to the input value - adding 2 cycles of latency which you could subtract out of your measurement. If the output value has a known form like 0 or -1 you can make that 1 cycle or even 0 cycles if the output is directly suitable as input but since you'll subtract it out it doesn't really matter.


> Good point. I wasn't thinking in those terms for a benchmark, and I tend to think serially anyway. I'd
> figure the memory read and especially write would limit parallelism in a single thread, but now I'll
> have to consider it.

Memory access is very parallel all down the pipe on Intel and AMD. L1 hits are essentially "infinitely parallel" at least on modern Intel: other than the restriction of dispatching 2 per cycle you can more or less have as many in flight at once as you want (the short L1 hit limits this to about 14 no matter how hard you try though).

L1 misses are limited by the number of fill buffers per core, which is 10 on recent Intel so you can get sometimes close a factor of 10 speedup over serial access by making sure your accesses are parallel. So it is not at all true that memory is a big parallelism limiter (if you miss in the TLB you do get limited a bit by 1 or 2 available page walkers though - but 2MB pages can come to the rescue...).

> Especially (theoretically) on Ryzen, with its single-cycle multi-pipe POPCNT.

Yeah Ryzan has awesome bit-counting support: popcnt, lzcnt and tzcnt are all way faster than on Intel. It's a shame that pdep and pext are awful though.



> > BTW there are gcc (and clang) builtins for all of the ops you are using inline ASM for [...]
>
> True, for the very simple stuff. I haven't noticed any advantage in optimization (yet; after all, the black boxes
> are very small), but I should whip up intrinsic versions anyway.

Whether it helps or not really depends on the scenario, but it doesn't matter how small the box is really: optimization can make the intrinsic cost 0 in many cases.

For example, see this example based on your code. The call_intrin is your code but with intrinsics: not that gcc removes the BSWAP entirely since it can figure out that byte order doesn't change the answer: that can never happen with inline asm.

The next example is when passing a constant 0 as the high mask: now the optimizer is able to remove, in the intrinsic case, all the code associated with the high mask since it can calculate the result at compile time. The inline asm can't do that.

So you should consider carefully the use case: general functions where you don't know all the calling scenarios are often better off with intrinsics since they optimize much better when something is known about the arguments.

Note that not all the compilers are equally good at this: clang doesn't do the bswap removal but is otherwise mostly as good as gcc, and icc just sucks here.

> Oh, one of my BSF versions (of the base mask to size) is an intrinsic
> form, and GCC adds an extra instruction to the dependency chain. Can't trust 'em.

Yeah, they are adding extra dep-breaking zeroing xors often to avoid the false dependency problem with bsf, tzcnt, lzcnt and popcnt. This is often much faster than not including them (in fact it was limiting me to 3 cycles above), but sometimes when you know better and are avoiding the dep in some other way it's not good.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
TIL: simple vs complex addressing is resolved at rename time (probably)Travis2018/08/03 01:34 PM
  TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 01:40 AM
    TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 05:05 AM
      TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 07:00 AM
        TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 08:32 AM
          TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 09:48 AM
            TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 10:19 AM
  Data-dependent instruction latencyPeter E. Fry2018/08/04 07:14 AM
    ... or a compiler optimizing aggressively?Heikki Kultala2018/08/04 08:13 AM
      ... or a compiler optimizing aggressively?Peter E. Fry2018/08/04 08:53 AM
    Data-dependent instruction latencyTravis2018/08/04 03:33 PM
      Data-dependent instruction latencyPeter E. Fry2018/08/05 09:13 AM
        Data-dependent instruction latencyTravis2018/08/05 04:55 PM
          Data-dependent instruction latencyPeter E. Fry2018/08/06 07:34 AM
            Data-dependent instruction latencyTravis2018/08/06 05:10 PM
              Data-dependent instruction latencyPeter E. Fry2018/08/07 07:09 AM
                Data-dependent instruction latencyPeter E. Fry2018/08/07 07:11 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?