By: Eric Fink (eric.delete@this.anon.com), June 4, 2022 5:59 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on June 3, 2022 11:10 pm wrote:
> Eric Fink (eric.delete.delete@this.this.anon.com) on June 3, 2022 12:23 am wrote:
> > In my (admittedly very limited experience) with NEON lack of pmovmskb has been much less painful
> > than originally expected because NEON has fast horizontal reduction (min/max/sum).
> We do indeed use those to emulate movmskb e.g. for counting the number of comparison results.
> On x86 it's movmskb+popcnt, on Arm it can be negate+addv:
> https://github.com/google/highway/blob/master/hwy/ops/arm_neon-inl.h#L5235
> Thing is, ADDV is a much more expensive op than just wiring, and it's 4 cycle latency
> vs 1 for movmskb, and that hurts for horizontal (non-batched) hash tables.
I just had a look at timing tables. On M1, horizontal reduction is 3 cycles with 4 ops per cycle, on recent Intel movmskb is 2 cycle with 1 op per cycle. It seems that the winner will depend on what you need to do. I definitely agree that movmskb doesn’t have much competition if you need to detect active lanes
> > and I was able to quickly reformulate all the relevant algorithms, at times achieving significant
> > reductions in the number of instructions needed compared to SSE with pmovmskb.
> Interesting, why is that?
Two reasons. First - the geometric tests I am doing usually involve a bunch of data-parallel computations and then picking an extremum of some sort (e.g. intersect a segment against eight lines and find which intersection point is the “furthest”). Horizontal reduction in neon gives me the result directly and immediately in such cases. Second - NEON offers you more options when working with NaNs, which simplifies dealing with edge cases.
>
> > But is't that again an example of optimising for throughtput rather than latency?
> > Batching hash table lookups is hardly an option for many applications.
> hm, many applications? Why not? I'd venture that the problem is that almost all SW interfaces
> (including STL) are limited to single items. To the extent we can break out of that, we are rewarded
> with 3-5x end to end speedups. For example in hash joins, which burn a lot of compute.
>
> > What I primarily have in mind is leveraging SIMD to make certain basic discrete
> > operations faster, rather then converting the entire logic to be data-parallel.
> Indeed we get best results from making everything data-parallel. Is
> there any particular hindrance besides having to rewrite/redesign?
Thinking about it, you are probably right. I am not doing enough isolated operations for them to be a meaningful optimization. And the actually performance-critical parts would definitely benefit from a redesign. Thanks, you certainly gave me something to think about :)
> Eric Fink (eric.delete.delete@this.this.anon.com) on June 3, 2022 12:23 am wrote:
> > In my (admittedly very limited experience) with NEON lack of pmovmskb has been much less painful
> > than originally expected because NEON has fast horizontal reduction (min/max/sum).
> We do indeed use those to emulate movmskb e.g. for counting the number of comparison results.
> On x86 it's movmskb+popcnt, on Arm it can be negate+addv:
> https://github.com/google/highway/blob/master/hwy/ops/arm_neon-inl.h#L5235
> Thing is, ADDV is a much more expensive op than just wiring, and it's 4 cycle latency
> vs 1 for movmskb, and that hurts for horizontal (non-batched) hash tables.
I just had a look at timing tables. On M1, horizontal reduction is 3 cycles with 4 ops per cycle, on recent Intel movmskb is 2 cycle with 1 op per cycle. It seems that the winner will depend on what you need to do. I definitely agree that movmskb doesn’t have much competition if you need to detect active lanes
> > and I was able to quickly reformulate all the relevant algorithms, at times achieving significant
> > reductions in the number of instructions needed compared to SSE with pmovmskb.
> Interesting, why is that?
Two reasons. First - the geometric tests I am doing usually involve a bunch of data-parallel computations and then picking an extremum of some sort (e.g. intersect a segment against eight lines and find which intersection point is the “furthest”). Horizontal reduction in neon gives me the result directly and immediately in such cases. Second - NEON offers you more options when working with NaNs, which simplifies dealing with edge cases.
>
> > But is't that again an example of optimising for throughtput rather than latency?
> > Batching hash table lookups is hardly an option for many applications.
> hm, many applications? Why not? I'd venture that the problem is that almost all SW interfaces
> (including STL) are limited to single items. To the extent we can break out of that, we are rewarded
> with 3-5x end to end speedups. For example in hash joins, which burn a lot of compute.
>
> > What I primarily have in mind is leveraging SIMD to make certain basic discrete
> > operations faster, rather then converting the entire logic to be data-parallel.
> Indeed we get best results from making everything data-parallel. Is
> there any particular hindrance besides having to rewrite/redesign?
Thinking about it, you are probably right. I am not doing enough isolated operations for them to be a meaningful optimization. And the actually performance-critical parts would definitely benefit from a redesign. Thanks, you certainly gave me something to think about :)