By: Michael S (already5chosen.delete@this.yahoo.com), May 20, 2022 4:51 am
Room: Moderated Discussions
-.- (blarg.delete@this.mailinator.com) on May 20, 2022 3:55 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on May 19, 2022 1:14 am wrote:
> > E.g. you invested significant effort in manual optimization of AVX path. You organized your arrays
> > in 256-bit oriented "hybrid" (==AoSoA) data layout and use of _mm256_xxx() in two inner levels
> > of loops. And you achieved good results, say, 65-70% of peak FLOPs of your AVX2 CPU. Then now you
> > probably want to achieve similar or slightly lower sustained-to-peak FLOPs ratio with AVX-512.
>
> Why not something like:
>
>
> and with a bit of find & replace, the code will magically transform based on a compiler
> switch. Retains 100% ISA functionality unlike other SIMD abstraction layers.
> Obviously won't work if your workload is highly width dependent, but if it isn't, should
> do a reasonably good job (and you can sprinkle #ifdefs where it's advantageous).
Yes, it can work in my case.
But it's certainly not the same as suggested in the post that started this branch of discussion:
Now, after reading the rest of Jan's posts, I am starting to believe that in his case it is indeed that simple, but only because he and his co-workers turned potentially compute-bounded problem into LS bounded, losing in the process factor of 2 of potential performance (2 at best, if inner-loop's data set still fits in L1D, otherwise the factor is bigger than 2) for sake of portability and of simplification their own work.
> Michael S (already5chosen.delete@this.yahoo.com) on May 19, 2022 1:14 am wrote:
> > E.g. you invested significant effort in manual optimization of AVX path. You organized your arrays
> > in 256-bit oriented "hybrid" (==AoSoA) data layout and use of _mm256_xxx() in two inner levels
> > of loops. And you achieved good results, say, 65-70% of peak FLOPs of your AVX2 CPU. Then now you
> > probably want to achieve similar or slightly lower sustained-to-peak FLOPs ratio with AVX-512.
>
> Why not something like:
>
>
#ifdef __AVX512F__>
> # define _mm(f) _mm512_##f
> # define __mfloat __m512
>
> # include "your-code-file.c"
>
> # undef _mm
> # undef __mfloat
>
> #else
>
> # define _mm(f) _mm256_##f
> # define __mfloat __m256
>
> # include "your-code-file.c"
>
> # undef _mm
> # undef __mfloat
>
> #endif
> and with a bit of find & replace, the code will magically transform based on a compiler
> switch. Retains 100% ISA functionality unlike other SIMD abstraction layers.
> Obviously won't work if your workload is highly width dependent, but if it isn't, should
> do a reasonably good job (and you can sprinkle #ifdefs where it's advantageous).
Yes, it can work in my case.
But it's certainly not the same as suggested in the post that started this branch of discussion:
From where I sit, the only investment required for AVX-512 is a few extra seconds of compile time plus maybe ~100KiB larger binaries.
Now, after reading the rest of Jan's posts, I am starting to believe that in his case it is indeed that simple, but only because he and his co-workers turned potentially compute-bounded problem into LS bounded, losing in the process factor of 2 of potential performance (2 at best, if inner-loop's data set still fits in L1D, otherwise the factor is bigger than 2) for sake of portability and of simplification their own work.