By: Peter E. Fry (pfry.delete@this.tailbone.net), August 6, 2018 7:34 am
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on August 5, 2018 4:55 pm wrote:
> Do you have some actual numbers for that you are seeing,
> such as nanoseconds or clocks per Mask_to_Size call?
I do not, not with this setup. I'm doing most of this on my Windows TV/game machine, and I'm in quick-and-dirty mode. I need to get the servers off my desk and a real workstation in their place.
[...]
> You could cut out some instructions with an unroll which would remove some of the loop overhead like
Interesting! I'll have to try it. Much of the time I'm locked into a serial form due to variances in the data, and uop caches make unrolling generally less profitable.
[...]
> If you wanted to take the next (and probably final step) you could organize it so that the unrolled
> loop used different registers for inputs in different iterations [...]
Good point. I wasn't thinking in those terms for a benchmark, and I tend to think serially anyway. I'd figure the memory read and especially write would limit parallelism in a single thread, but now I'll have to consider it. Especially (theoretically) on Ryzen, with its single-cycle multi-pipe POPCNT.
> If you get rid of the BSWAPs it would help.
Haw haw haw! That's being locked into a mode: BSF and the validation routines require normalized data, so I failed to notice that bare POPCNT does not. I suppose I should create explicit big- and little-endian forms to avoid errors and improve portability (not that it'll ever matter).
> I didn't find any evidence of BSF going "too fast" which is
> the thing you originally mentioned (and got me interested).
Made you look. No, it's just me - I have a hard time slotting the dependency chain into the multi-cycle latency of BSF; combine that with poor POPCNT results, and it confuses me. It's likely a moot point anyway.
> BTW there are gcc (and clang) builtins for all of the ops you are using inline ASM for [...]
True, for the very simple stuff. I haven't noticed any advantage in optimization (yet; after all, the black boxes are very small), but I should whip up intrinsic versions anyway. Even slightly more complex stuff (like the validation) tends to work better in assembler (that validation routine is just the most compact - I have several that run faster on Haswell+... but likely not on Ryzen). Oh, one of my BSF versions (of the base mask to size) is an intrinsic form, and GCC adds an extra instruction to the dependency chain. Can't trust 'em.
Thanks again. You've given me a lot to think about.
> Do you have some actual numbers for that you are seeing,
> such as nanoseconds or clocks per Mask_to_Size call?
I do not, not with this setup. I'm doing most of this on my Windows TV/game machine, and I'm in quick-and-dirty mode. I need to get the servers off my desk and a real workstation in their place.
[...]
> You could cut out some instructions with an unroll which would remove some of the loop overhead like
add
> $0x10,%rcx
and the cmp/jne
. Indeed, when I add -funroll-loops to the command line I get 3.3 cycles/elem.Interesting! I'll have to try it. Much of the time I'm locked into a serial form due to variances in the data, and uop caches make unrolling generally less profitable.
[...]
> If you wanted to take the next (and probably final step) you could organize it so that the unrolled
> loop used different registers for inputs in different iterations [...]
Good point. I wasn't thinking in those terms for a benchmark, and I tend to think serially anyway. I'd figure the memory read and especially write would limit parallelism in a single thread, but now I'll have to consider it. Especially (theoretically) on Ryzen, with its single-cycle multi-pipe POPCNT.
> If you get rid of the BSWAPs it would help.
Haw haw haw! That's being locked into a mode: BSF and the validation routines require normalized data, so I failed to notice that bare POPCNT does not. I suppose I should create explicit big- and little-endian forms to avoid errors and improve portability (not that it'll ever matter).
> I didn't find any evidence of BSF going "too fast" which is
> the thing you originally mentioned (and got me interested).
Made you look. No, it's just me - I have a hard time slotting the dependency chain into the multi-cycle latency of BSF; combine that with poor POPCNT results, and it confuses me. It's likely a moot point anyway.
> BTW there are gcc (and clang) builtins for all of the ops you are using inline ASM for [...]
True, for the very simple stuff. I haven't noticed any advantage in optimization (yet; after all, the black boxes are very small), but I should whip up intrinsic versions anyway. Even slightly more complex stuff (like the validation) tends to work better in assembler (that validation routine is just the most compact - I have several that run faster on Haswell+... but likely not on Ryzen). Oh, one of my BSF versions (of the base mask to size) is an intrinsic form, and GCC adds an extra instruction to the dependency chain. Can't trust 'em.
Thanks again. You've given me a lot to think about.