By: Jukka Larja (roskakori2006.delete@this.gmail.com), May 24, 2022 7:38 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 24, 2022 4:35 am wrote:
> Jukka Larja (roskakori2006.delete@this.gmail.com) on May 24, 2022 2:48 am wrote:
> > How is it a solved problem? I have no idea what to do to solve that problem
> > within our code base, without significant development costs.
> At least for new code, I am happy with the following approach:
> 1) functions that use SIMD and are too short to justify an indirect call are implemented as forceinline
> functions in headers, e.g. https://github.com/google/highway/blob/master/hwy/examples/skeleton-inl.h
>
> 2) The code that calls these functions is compiled multiple times, once per target. Some boilerplate
> (https://github.com/google/highway/blob/master/hwy/examples/skeleton.cc) takes care of the multiple compilation
> (no assistance needed from the build system) and generates a table of function pointers.
>
> 3) To call the code (again, infrequently enough that indirect calls are fine), a HWY_DYNAMIC_DISPATCH(Func)(args)
> macro expands to checking CPU capabilities and selecting the appropriate function from the table.
>
> 4) All of this can be hidden behind a C++ function whose declaration (in a header: https://github.com/google/highway/blob/master/hwy/examples/skeleton.h)
> looks entirely normal.
>
> The result is a bit more compile time, and a binary that replicates the per-CPU code
> (not the entire binary). I'm curious how well that would integrate into your code?
Not that well really. What you describe pretty obviously has significant development cost, when considering what is significant for us (also considering benefit of moving from SSE2 to AVX or AVX-512. We haven't yet even done much with SSE2).
A realistic option for us would be something that's basically a compile flag and produces a fat binary or something like that. For whatever autovectorization compiler manages and whatever is in third party code.
-JLarja
> Jukka Larja (roskakori2006.delete@this.gmail.com) on May 24, 2022 2:48 am wrote:
> > How is it a solved problem? I have no idea what to do to solve that problem
> > within our code base, without significant development costs.
> At least for new code, I am happy with the following approach:
> 1) functions that use SIMD and are too short to justify an indirect call are implemented as forceinline
> functions in headers, e.g. https://github.com/google/highway/blob/master/hwy/examples/skeleton-inl.h
>
> 2) The code that calls these functions is compiled multiple times, once per target. Some boilerplate
> (https://github.com/google/highway/blob/master/hwy/examples/skeleton.cc) takes care of the multiple compilation
> (no assistance needed from the build system) and generates a table of function pointers.
>
> 3) To call the code (again, infrequently enough that indirect calls are fine), a HWY_DYNAMIC_DISPATCH(Func)(args)
> macro expands to checking CPU capabilities and selecting the appropriate function from the table.
>
> 4) All of this can be hidden behind a C++ function whose declaration (in a header: https://github.com/google/highway/blob/master/hwy/examples/skeleton.h)
> looks entirely normal.
>
> The result is a bit more compile time, and a binary that replicates the per-CPU code
> (not the entire binary). I'm curious how well that would integrate into your code?
Not that well really. What you describe pretty obviously has significant development cost, when considering what is significant for us (also considering benefit of moving from SSE2 to AVX or AVX-512. We haven't yet even done much with SSE2).
A realistic option for us would be something that's basically a compile flag and produces a fat binary or something like that. For whatever autovectorization compiler manages and whatever is in third party code.
-JLarja