By: Michael S (already5chosen.delete@this.yahoo.com), May 19, 2022 8:13 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 19, 2022 7:02 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on May 19, 2022 1:14 am wrote:
> > If you are currently utilizing, say, 30% of AVX2+FMA computational
> > ability and want to improve it to utilization
> > of, say, 25% of AVX-512 ability then your method is likely (for same value of 'likely') to work.
> >
> > But what if your goals are more ambitious?
> > E.g. you invested significant effort in manual optimization of AVX path. You organized your arrays
> > in 256-bit oriented "hybrid" (==AoSoA) data layout and use of _mm256_xxx() in two inner levels
> > of loops. And you achieved good results, say, 65-70% of peak FLOPs of your AVX2 CPU. Then now you
> > probably want to achieve similar or slightly lower sustained-to-peak FLOPs ratio with AVX-512.
> > Flopping compiler switch is of little help in such case. More likely, of no help at all.
>
> I understand it's not fun to port existing code. Let's imagine a scenario where the data layout/interleaving
> is chosen to be larger than any expected vector (say 4096 bits, enough even for current RISC-V
> with LMUL=8), and instead of _mm256 you're using Highway wrapper functions plus a loop (possibly
> fully unrolled by the compiler) over your 4096-bit blocks.
May be, for your jobs it is a good layout, for many of my jobs it's too coarse.
For example, one of my important workloads is decomposition of Hermitian matrices with N in range 50 to 100. Padding everything to multiples of 64 doubles will reduce performance by bigger factor than 2x difference between AVX-512 and AVX2 (pay attention, Hermitian/symmetric matrices are triangle-shaped, rather than squares, which doubles the impact). That's before we face the fact that on majority of Intel's HW the real difference in throughput is significantly less than 2x.
Thinking about it, even padding to 8 doubles would eat significant portion of potential gain, but in this case it, hopefully, would not eat all of the gain.
>
> Then AVX-512 is indeed just a recompile away, and in shipping production code
> (JPEG XL, vqsort) I can report 1.4-1.6x speedups, including throttling.
The question is - how close you are to peak FLOPs?
As I said in my original post, if you are at 30% then easy gains are certainly possible.
Also, if you are at 30% then you probably does not suffer from thermal throttling of the CPU frequency nearly as much as those who are at 60-80%.
> Michael S (already5chosen.delete@this.yahoo.com) on May 19, 2022 1:14 am wrote:
> > If you are currently utilizing, say, 30% of AVX2+FMA computational
> > ability and want to improve it to utilization
> > of, say, 25% of AVX-512 ability then your method is likely (for same value of 'likely') to work.
> >
> > But what if your goals are more ambitious?
> > E.g. you invested significant effort in manual optimization of AVX path. You organized your arrays
> > in 256-bit oriented "hybrid" (==AoSoA) data layout and use of _mm256_xxx() in two inner levels
> > of loops. And you achieved good results, say, 65-70% of peak FLOPs of your AVX2 CPU. Then now you
> > probably want to achieve similar or slightly lower sustained-to-peak FLOPs ratio with AVX-512.
> > Flopping compiler switch is of little help in such case. More likely, of no help at all.
>
> I understand it's not fun to port existing code. Let's imagine a scenario where the data layout/interleaving
> is chosen to be larger than any expected vector (say 4096 bits, enough even for current RISC-V
> with LMUL=8), and instead of _mm256 you're using Highway wrapper functions plus a loop (possibly
> fully unrolled by the compiler) over your 4096-bit blocks.
May be, for your jobs it is a good layout, for many of my jobs it's too coarse.
For example, one of my important workloads is decomposition of Hermitian matrices with N in range 50 to 100. Padding everything to multiples of 64 doubles will reduce performance by bigger factor than 2x difference between AVX-512 and AVX2 (pay attention, Hermitian/symmetric matrices are triangle-shaped, rather than squares, which doubles the impact). That's before we face the fact that on majority of Intel's HW the real difference in throughput is significantly less than 2x.
Thinking about it, even padding to 8 doubles would eat significant portion of potential gain, but in this case it, hopefully, would not eat all of the gain.
>
> Then AVX-512 is indeed just a recompile away, and in shipping production code
> (JPEG XL, vqsort) I can report 1.4-1.6x speedups, including throttling.
The question is - how close you are to peak FLOPs?
As I said in my original post, if you are at 30% then easy gains are certainly possible.
Also, if you are at 30% then you probably does not suffer from thermal throttling of the CPU frequency nearly as much as those who are at 60-80%.