By: Adrian (a.delete@this.acm.org), May 22, 2022 5:04 am
Room: Moderated Discussions
Charlie Burnes (charlie.burnes.delete@this.no-spam.com) on May 21, 2022 10:11 pm wrote:
>
> The main thing I’m struggling with is I can’t write my code in a way that is SIMD width agnostic. I
> would like to write it using 512-bit vectors and have some way to automatically make it run on machines
> with 256-bit and 128-bit vectors. That does not seem to be possible with Highway, as far as I can determine.
One way to write code, which can be SIMD width agnostic, is the GPU style, a.k.a. single-program-multiple-data, e.g. as implemented by the free Intel ISPC compiler (which can compile for the many kinds of Intel/AMD ISA and for ARM). In this case the program is written for a single lane and the compiler takes care to execute in parallel as many "threads" as corresponding to the width of the SIMD registers.
In do not have experience with ISPC, but I have seen reports from people who had been surprised by an unexpectedly good performance of the code generated by ISPC, in comparison with more widespread methods of writing SIMD code.
It may happen that some algorithm cannot be SIMD width agnostic when also wanting maximum performance, because the associated data structures must have different layouts, depending on the SIMD width.
That does not mean that different programs must be written. A single generic program, having the width as a parameter, should suffice.
While this is more complicated than being able to ignore the width, I do not find it as the greatest difficulty, as when targeting multiple ISA variants with a single program (that does not rely on compiler vectorization) I have to do a lot of other macro processing, e.g. for substituting some generic names for SIMD operations with the actual names used for assembly instructions or C intrinsics in the target ISA (which can also be affected by the register width parameter), so parameterized data structures are just a part of it.
Writing a program for 512-bit vectors and having it converted automatically to other register widths does not seem feasible in the general case, because the compiler cannot always know where in the code and in the data structures assumptions about the register width have been used. Therefore I believe that the only general way is to use a vector width parameter, to make explicit in the code all the places that depend on knowing the width.
>
> The main thing I’m struggling with is I can’t write my code in a way that is SIMD width agnostic. I
> would like to write it using 512-bit vectors and have some way to automatically make it run on machines
> with 256-bit and 128-bit vectors. That does not seem to be possible with Highway, as far as I can determine.
One way to write code, which can be SIMD width agnostic, is the GPU style, a.k.a. single-program-multiple-data, e.g. as implemented by the free Intel ISPC compiler (which can compile for the many kinds of Intel/AMD ISA and for ARM). In this case the program is written for a single lane and the compiler takes care to execute in parallel as many "threads" as corresponding to the width of the SIMD registers.
In do not have experience with ISPC, but I have seen reports from people who had been surprised by an unexpectedly good performance of the code generated by ISPC, in comparison with more widespread methods of writing SIMD code.
It may happen that some algorithm cannot be SIMD width agnostic when also wanting maximum performance, because the associated data structures must have different layouts, depending on the SIMD width.
That does not mean that different programs must be written. A single generic program, having the width as a parameter, should suffice.
While this is more complicated than being able to ignore the width, I do not find it as the greatest difficulty, as when targeting multiple ISA variants with a single program (that does not rely on compiler vectorization) I have to do a lot of other macro processing, e.g. for substituting some generic names for SIMD operations with the actual names used for assembly instructions or C intrinsics in the target ISA (which can also be affected by the register width parameter), so parameterized data structures are just a part of it.
Writing a program for 512-bit vectors and having it converted automatically to other register widths does not seem feasible in the general case, because the compiler cannot always know where in the code and in the data structures assumptions about the register width have been used. Therefore I believe that the only general way is to use a vector width parameter, to make explicit in the code all the places that depend on knowing the width.