By: -.- (blarg.delete@this.mailinator.com), May 20, 2022 3:55 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on May 19, 2022 1:14 am wrote:
> E.g. you invested significant effort in manual optimization of AVX path. You organized your arrays
> in 256-bit oriented "hybrid" (==AoSoA) data layout and use of _mm256_xxx() in two inner levels
> of loops. And you achieved good results, say, 65-70% of peak FLOPs of your AVX2 CPU. Then now you
> probably want to achieve similar or slightly lower sustained-to-peak FLOPs ratio with AVX-512.
Why not something like:
and with a bit of find & replace, the code will magically transform based on a compiler switch. Retains 100% ISA functionality unlike other SIMD abstraction layers.
Obviously won't work if your workload is highly width dependent, but if it isn't, should do a reasonably good job (and you can sprinkle #ifdefs where it's advantageous).
> E.g. you invested significant effort in manual optimization of AVX path. You organized your arrays
> in 256-bit oriented "hybrid" (==AoSoA) data layout and use of _mm256_xxx() in two inner levels
> of loops. And you achieved good results, say, 65-70% of peak FLOPs of your AVX2 CPU. Then now you
> probably want to achieve similar or slightly lower sustained-to-peak FLOPs ratio with AVX-512.
Why not something like:
#ifdef __AVX512F__
# define _mm(f) _mm512_##f
# define __mfloat __m512
# include "your-code-file.c"
# undef _mm
# undef __mfloat
#else
# define _mm(f) _mm256_##f
# define __mfloat __m256
# include "your-code-file.c"
# undef _mm
# undef __mfloat
#endif
and with a bit of find & replace, the code will magically transform based on a compiler switch. Retains 100% ISA functionality unlike other SIMD abstraction layers.
Obviously won't work if your workload is highly width dependent, but if it isn't, should do a reasonably good job (and you can sprinkle #ifdefs where it's advantageous).