By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), June 3, 2022 11:10 pm
Room: Moderated Discussions
Eric Fink (eric.delete.delete@this.this.anon.com) on June 3, 2022 12:23 am wrote:
> In my (admittedly very limited experience) with NEON lack of pmovmskb has been much less painful
> than originally expected because NEON has fast horizontal reduction (min/max/sum).
We do indeed use those to emulate movmskb e.g. for counting the number of comparison results.
On x86 it's movmskb+popcnt, on Arm it can be negate+addv:
https://github.com/google/highway/blob/master/hwy/ops/arm_neon-inl.h#L5235
Thing is, ADDV is a much more expensive op than just wiring, and it's 4 cycle latency vs 1 for movmskb, and that hurts for horizontal (non-batched) hash tables.
> and I was able to quickly reformulate all the relevant algorithms, at times achieving significant
> reductions in the number of instructions needed compared to SSE with pmovmskb.
Interesting, why is that?
> But is't that again an example of optimising for throughtput rather than latency?
> Batching hash table lookups is hardly an option for many applications.
hm, many applications? Why not? I'd venture that the problem is that almost all SW interfaces (including STL) are limited to single items. To the extent we can break out of that, we are rewarded with 3-5x end to end speedups. For example in hash joins, which burn a lot of compute.
> What I primarily have in mind is leveraging SIMD to make certain basic discrete
> operations faster, rather then converting the entire logic to be data-parallel.
Indeed we get best results from making everything data-parallel. Is there any particular hindrance besides having to rewrite/redesign?
> In my (admittedly very limited experience) with NEON lack of pmovmskb has been much less painful
> than originally expected because NEON has fast horizontal reduction (min/max/sum).
We do indeed use those to emulate movmskb e.g. for counting the number of comparison results.
On x86 it's movmskb+popcnt, on Arm it can be negate+addv:
https://github.com/google/highway/blob/master/hwy/ops/arm_neon-inl.h#L5235
Thing is, ADDV is a much more expensive op than just wiring, and it's 4 cycle latency vs 1 for movmskb, and that hurts for horizontal (non-batched) hash tables.
> and I was able to quickly reformulate all the relevant algorithms, at times achieving significant
> reductions in the number of instructions needed compared to SSE with pmovmskb.
Interesting, why is that?
> But is't that again an example of optimising for throughtput rather than latency?
> Batching hash table lookups is hardly an option for many applications.
hm, many applications? Why not? I'd venture that the problem is that almost all SW interfaces (including STL) are limited to single items. To the extent we can break out of that, we are rewarded with 3-5x end to end speedups. For example in hash joins, which burn a lot of compute.
> What I primarily have in mind is leveraging SIMD to make certain basic discrete
> operations faster, rather then converting the entire logic to be data-parallel.
Indeed we get best results from making everything data-parallel. Is there any particular hindrance besides having to rewrite/redesign?