By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), May 17, 2022 12:26 pm
Room: Moderated Discussions
> Is there some way to automatically convert SVE2 code to slower
> NEON code so that you don’t have to write two versions?
> If code was autovectorized, it would be easy to make two different executables, one for SVE2 and one
> for NEON. If SVE2 code was written by hand, there would be no way to run the SVE2 code on an older device without SVE2.
> No one wants to write two different versions by hand, one for SVE2 and one for NEON.
Agreed. github.com/google/highway helps with this - it provides "portable intrinsics" (wrapper functions that call NEON or SVE[2] intrinsics) that you call, and supports compiling your code once per platform and then dispatching to the best available one either at compile-time or runtime.
This works on x86 but arm_neon/arm_sve.h currently require compiler flags to be set before including them. Thus the best we can currently do for Arm (until the compiler is updated to lift this limitation) is to compile the same source file multiple times with different compiler flags.
For an example of this in action, see vqsort (vectorized quicksort): https://arxiv.org/abs/2205.05982
Happy to discuss via Github issues or email.
> NEON code so that you don’t have to write two versions?
> If code was autovectorized, it would be easy to make two different executables, one for SVE2 and one
> for NEON. If SVE2 code was written by hand, there would be no way to run the SVE2 code on an older device without SVE2.
> No one wants to write two different versions by hand, one for SVE2 and one for NEON.
Agreed. github.com/google/highway helps with this - it provides "portable intrinsics" (wrapper functions that call NEON or SVE[2] intrinsics) that you call, and supports compiling your code once per platform and then dispatching to the best available one either at compile-time or runtime.
This works on x86 but arm_neon/arm_sve.h currently require compiler flags to be set before including them. Thus the best we can currently do for Arm (until the compiler is updated to lift this limitation) is to compile the same source file multiple times with different compiler flags.
For an example of this in action, see vqsort (vectorized quicksort): https://arxiv.org/abs/2205.05982
Happy to discuss via Github issues or email.