By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), May 19, 2022 7:02 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on May 19, 2022 1:14 am wrote:
> If you are currently utilizing, say, 30% of AVX2+FMA computational ability and want to improve it to utilization
> of, say, 25% of AVX-512 ability then your method is likely (for same value of 'likely') to work.
>
> But what if your goals are more ambitious?
> E.g. you invested significant effort in manual optimization of AVX path. You organized your arrays
> in 256-bit oriented "hybrid" (==AoSoA) data layout and use of _mm256_xxx() in two inner levels
> of loops. And you achieved good results, say, 65-70% of peak FLOPs of your AVX2 CPU. Then now you
> probably want to achieve similar or slightly lower sustained-to-peak FLOPs ratio with AVX-512.
> Flopping compiler switch is of little help in such case. More likely, of no help at all.
I understand it's not fun to port existing code. Let's imagine a scenario where the data layout/interleaving is chosen to be larger than any expected vector (say 4096 bits, enough even for current RISC-V with LMUL=8), and instead of _mm256 you're using Highway wrapper functions plus a loop (possibly fully unrolled by the compiler) over your 4096-bit blocks.
Then AVX-512 is indeed just a recompile away, and in shipping production code (JPEG XL, vqsort) I can report 1.4-1.6x speedups, including throttling.
> If you are currently utilizing, say, 30% of AVX2+FMA computational ability and want to improve it to utilization
> of, say, 25% of AVX-512 ability then your method is likely (for same value of 'likely') to work.
>
> But what if your goals are more ambitious?
> E.g. you invested significant effort in manual optimization of AVX path. You organized your arrays
> in 256-bit oriented "hybrid" (==AoSoA) data layout and use of _mm256_xxx() in two inner levels
> of loops. And you achieved good results, say, 65-70% of peak FLOPs of your AVX2 CPU. Then now you
> probably want to achieve similar or slightly lower sustained-to-peak FLOPs ratio with AVX-512.
> Flopping compiler switch is of little help in such case. More likely, of no help at all.
I understand it's not fun to port existing code. Let's imagine a scenario where the data layout/interleaving is chosen to be larger than any expected vector (say 4096 bits, enough even for current RISC-V with LMUL=8), and instead of _mm256 you're using Highway wrapper functions plus a loop (possibly fully unrolled by the compiler) over your 4096-bit blocks.
Then AVX-512 is indeed just a recompile away, and in shipping production code (JPEG XL, vqsort) I can report 1.4-1.6x speedups, including throttling.