By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), May 24, 2022 4:35 am
Room: Moderated Discussions
Jukka Larja (roskakori2006.delete@this.gmail.com) on May 24, 2022 2:48 am wrote:
> How is it a solved problem? I have no idea what to do to solve that problem
> within our code base, without significant development costs.
At least for new code, I am happy with the following approach:
1) functions that use SIMD and are too short to justify an indirect call are implemented as forceinline functions in headers, e.g. https://github.com/google/highway/blob/master/hwy/examples/skeleton-inl.h
2) The code that calls these functions is compiled multiple times, once per target. Some boilerplate (https://github.com/google/highway/blob/master/hwy/examples/skeleton.cc) takes care of the multiple compilation (no assistance needed from the build system) and generates a table of function pointers.
3) To call the code (again, infrequently enough that indirect calls are fine), a HWY_DYNAMIC_DISPATCH(Func)(args) macro expands to checking CPU capabilities and selecting the appropriate function from the table.
4) All of this can be hidden behind a C++ function whose declaration (in a header: https://github.com/google/highway/blob/master/hwy/examples/skeleton.h) looks entirely normal.
The result is a bit more compile time, and a binary that replicates the per-CPU code (not the entire binary). I'm curious how well that would integrate into your code?
> How is it a solved problem? I have no idea what to do to solve that problem
> within our code base, without significant development costs.
At least for new code, I am happy with the following approach:
1) functions that use SIMD and are too short to justify an indirect call are implemented as forceinline functions in headers, e.g. https://github.com/google/highway/blob/master/hwy/examples/skeleton-inl.h
2) The code that calls these functions is compiled multiple times, once per target. Some boilerplate (https://github.com/google/highway/blob/master/hwy/examples/skeleton.cc) takes care of the multiple compilation (no assistance needed from the build system) and generates a table of function pointers.
3) To call the code (again, infrequently enough that indirect calls are fine), a HWY_DYNAMIC_DISPATCH(Func)(args) macro expands to checking CPU capabilities and selecting the appropriate function from the table.
4) All of this can be hidden behind a C++ function whose declaration (in a header: https://github.com/google/highway/blob/master/hwy/examples/skeleton.h) looks entirely normal.
The result is a bit more compile time, and a binary that replicates the per-CPU code (not the entire binary). I'm curious how well that would integrate into your code?