Data-dependent instruction latency

By: Travis (travis.downs.delete@this.gmail.com), August 5, 2018 4:55 pm
Room: Moderated Discussions
Do you have some actual numbers for that you are seeing, such as nanoseconds or clocks per Mask_to_Size call?

Here are typical values on my system (Skylake client) - I added "per elem" measurements in rdtsc clocks:

gcc 7:
Initialize mask array: 497584
Initialize mask array: 166602
IPv6_Mask_to_Size: 130220 ( 5.0 per elem)
IPv6_Mask_to_Size_BSF_2: 136162 ( 5.3 per elem)
IPv6_Mask_to_Size_Validate: 179214 ( 6.9 per elem)
mask2prefixlen6: 920268 (35.7 per elem)

clang 3.8:

Initialize mask array: 531518
Initialize mask array: 166260
IPv6_Mask_to_Size: 125746 ( 4.9 per elem)
IPv6_Mask_to_Size_BSF_2: 137836 ( 5.3 per elem)
IPv6_Mask_to_Size_Validate: 181888 ( 7.0 per elem)
mask2prefixlen6: 1333692 (51.7 per elem)


If I increase the number of iterations it gets down to 4 cycles per elem. That's about the best you are doing to do here without unrolling loops. The core loop for IPv6_Mask_to_Size_2 is:

  2.67 │1c0:   mov    (%rcx),%rdx
6.00 │ mov 0x8(%rcx),%rax
7.33 │ add $0x10,%rcx
12.00 │ bswap %rdx
11.33 │ bswap %rax
8.00 │ add $0x1,%rsi
7.33 │ xor %edi,%edi
7.33 │ xor %r9d,%r9d
9.33 │ popcnt %rdx,%rdi
6.00 │ popcnt %rax,%r9
10.00 │ add %r9,%rdi
4.00 │ mov %dil,-0x1(%rsi)
│ cmp %r8,%rcx
6.67 │ ↑ jne 1c0


That's 15 uops (BSWAP r64 is 2, and the cmp/jne presumably fuse). So you are getting though 15 (fused domain) uops in 4 cycles, or 3.75/cycle which is very close to the max. Probably you don't get to 4.00/cycle due to contention for p1 needed by popcnt and used also by BSWAP and any of the ALU ops.

You could cut out some instructions with an unroll which would remove some of the loop overhead like add $0x10,%rcx and the cmp/jne. Indeed, when I add -funroll-loops to the command line I get 3.3 cycles/elem.

If you wanted to improve that further, you could eliminate the xor calls in the inline ASM: these aren't strictly necessary since the subsequent popcnt instructions clobber them anyways. They help to avoid the false dependency that popcnt has on its dest register, but we are running in "throughput mode" here since the iterations are independent. Indeed, when I strip those out I get down to 3.0 cycles/elem.

gcc generated unrolled code that used the same destination registers for each popcnt, which means 3.0 cycles is the best you can get (for that codegen) as you are limited by the false dependency latency of the 3 cycle popcnt.

If you wanted to take the next (and probably final step) you could organize it so that the unrolled loop used different registers for inputs in different iterations, which would cut out the false dependency impact (e.g., if you use two sets of destination registers, the latency chain limit is now 1.5 cycles). Good luck trying to get gcc to do that though (perhaps you can unroll the test loop by 2 and include fake input params to your asm to convince gcc to keep the values from the first Mask_to_Size call live until the second call, so it has to allocate new registers for the second call). You are not going to get down to 2 cycles though: you just have too many uops: BSWAP(2) * 2 + POPCNT * 2 + 2 * load + store + add or 12 uops, completely ignoring loop overhead. So 2.5 cycles per elem is probably a limit (BSWAP with memory source doesn't help because apparently it doesn't micro-fuse).

If you get rid of the BSWAPs it would help.

I didn't find any evidence of BSF going "too fast" which is the thing you originally mentioned (and got me interested).

BTW there are gcc (and clang) builtins for all of the ops you are using inline ASM for, e.g., bswap, popcnt, bsf. You'll often get better optimization with builtins since the compiler can "see into them" and do all of the normal optimizations. asm is a black box with defined inputs and outputs.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
TIL: simple vs complex addressing is resolved at rename time (probably)Travis2018/08/03 01:34 PM
  TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 01:40 AM
    TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 05:05 AM
      TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 07:00 AM
        TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 08:32 AM
          TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 09:48 AM
            TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 10:19 AM
  Data-dependent instruction latencyPeter E. Fry2018/08/04 07:14 AM
    ... or a compiler optimizing aggressively?Heikki Kultala2018/08/04 08:13 AM
      ... or a compiler optimizing aggressively?Peter E. Fry2018/08/04 08:53 AM
    Data-dependent instruction latencyTravis2018/08/04 03:33 PM
      Data-dependent instruction latencyPeter E. Fry2018/08/05 09:13 AM
        Data-dependent instruction latencyTravis2018/08/05 04:55 PM
          Data-dependent instruction latencyPeter E. Fry2018/08/06 07:34 AM
            Data-dependent instruction latencyTravis2018/08/06 05:10 PM
              Data-dependent instruction latencyPeter E. Fry2018/08/07 07:09 AM
                Data-dependent instruction latencyPeter E. Fry2018/08/07 07:11 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?