Data-dependent instruction latency

By: Peter E. Fry (pfry.delete@this.tailbone.net), August 7, 2018 7:09 am
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on August 6, 2018 5:10 pm wrote:
> Right but you mentioned BSF running too fast, so I was wondering what actual numbers
> you say. Presumably you ran the code you posted, which does spit out numbers?

Raw? My, let's see. Here are some older benchmarks, running 10 x 2^32 iterations, register data only (no memory access), in seconds (yeah, well, they were repeatable):

Sandy:
IPv6 Mask to Size: 49
IPv6 Mask to Size BSF_2: 54
IPv6 Mask to Size BSF_4: 57

Ivy:
IPv6 Mask to Size: 55
IPv6 Mask to Size BSF_2: 52
IPv6 Mask to Size BSF_4: 51

Haswell:
IPv6 Mask to Size: 64
IPv6 Mask to Size BSF_2: 53
IPv6 Mask to Size BSF_4: 53

Using the "mintest" code:

Sandy, GCC:
Initialize mask array: 186276
IPv6_Mask_to_Size: 153615
IPv6_Mask_to_Size_BSF_2: 154286

Sandy, Clang:
Initialize mask array: 177778
IPv6_Mask_to_Size: 144584
IPv6_Mask_to_Size_BSF_2: 143846

Haswell, GCC:
Initialize mask array: 178227
IPv6_Mask_to_Size: 133539
IPv6_Mask_to_Size_BSF_2: 149258

Now that I post 'em, they doesn't support any assertion other than "I need to figure out what these benchmarks are doing". In my defense, I did point out that my sanity was suspect. (Spoiler, see below: a BSF routine with a bit more C and less assembler would be even faster than the above.) Summary: I see BSF vs. POPCNT results that I can't explain, and go off half-cocked.

> Note that your benchmark is totally parallel: [...]
> If your real world case is serial, make your test serial: this is as easy and feeding the output
> of one call into the next.

Well, the two main real-world scenarios are singular instances and... well, volume... Hm. I was thinking that arrays would be uncommon, but perhaps not.
As far as challenges to parallelism, I was thinking mostly of flags (x86) (e.g. the "validate" routine) and variable data (e.g. processing IP addresses as variable-length text, in or out). Perhaps I should try a real workload rather than optimizing imaginary ones.

[...]
> The next example is when passing a constant 0 as the high mask: now the optimizer is
> able to remove, in the intrinsic case, all the code associated with the high mask since
> it can calculate the result at compile time. The inline asm can't do that.

Heh. Not an IP guy? That's not a shot - it's just that you chose the high mask, where if it's 0, the low mask is normally 0. Call it an observation on specializations and contextual nitpicking.
At any rate, there are patterns to the data, but they are not normally resolvable at compile time. (If they were, I'd have a custom assembler routine for it, dammit.)

[...]
> Yeah, they are adding extra dep-breaking zeroing xors often to avoid the false dependency problem with bsf,
> tzcnt, lzcnt and popcnt. This is often much faster than not including them (in fact it was limiting me to 3
> cycles above), but sometimes when you know better and are avoiding the dep in some other way it's not good.

I don't believe the case I saw was a dependency-breaker. In fact... Ha! It pulled a fast one on me, literally:

My sequence in assembler:

subb $128, %b[Size]
addb %b[Temp], %b[Size]
negb %b[Size]


GCC, plugged into mintest:

Size_u64 = 128 - Size_u64 - Temp_u64;

289 02a3 4889D9 movq %rbx, %rcx
290 02a6 4C01F0 addq %r14, %rax
291 02a9 4829C1 subq %rax, %rcx


(RBX is pre-loaded with 128 outside of the loop.) The GCC sequence compiles poorly as a standalone sequence or function, but is faster inline.
OK, point taken.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
TIL: simple vs complex addressing is resolved at rename time (probably)Travis2018/08/03 01:34 PM
  TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 01:40 AM
    TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 05:05 AM
      TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 07:00 AM
        TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 08:32 AM
          TIL: simple vs complex addressing is resolved at rename time (probably)foobar2018/08/04 09:48 AM
            TIL: simple vs complex addressing is resolved at rename time (probably)anon2018/08/04 10:19 AM
  Data-dependent instruction latencyPeter E. Fry2018/08/04 07:14 AM
    ... or a compiler optimizing aggressively?Heikki Kultala2018/08/04 08:13 AM
      ... or a compiler optimizing aggressively?Peter E. Fry2018/08/04 08:53 AM
    Data-dependent instruction latencyTravis2018/08/04 03:33 PM
      Data-dependent instruction latencyPeter E. Fry2018/08/05 09:13 AM
        Data-dependent instruction latencyTravis2018/08/05 04:55 PM
          Data-dependent instruction latencyPeter E. Fry2018/08/06 07:34 AM
            Data-dependent instruction latencyTravis2018/08/06 05:10 PM
              Data-dependent instruction latencyPeter E. Fry2018/08/07 07:09 AM
                Data-dependent instruction latencyPeter E. Fry2018/08/07 07:11 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?