By: Michael S (already5chosen.delete@this.yahoo.com), May 24, 2022 3:48 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 23, 2022 10:38 pm wrote:
> Björn Ragnar Björnsson (bjorn.ragnar.delete@this.gmail.com) on May 23, 2022 5:41 pm wrote:
> > Hopefully they've learned
> > their lesson, to wit they disabled AVX-512 in BIOS updates, didn't they?
> It sounds like you view that as a good thing. My understanding is that people were
> more upset by the removal (especially after initial benchmarks included AVX-512).
>
> Perhaps you prefer to keep the status quo. Here's a problem, though: scheduling an instruction
> on big OoO cores costs considerably more energy than most operations, even FP mul. Thus vectorization
> is pretty much required for energy efficiency (it amortizes that cost across e.g. 16 lanes).
> We're talking 5-10x energy reduction here. Surely that is worth pursuing.
Did you try to measure efficiency gain yourself?
For example, AVX-512 vs AVX2 on your own JPEG-XL decode on widely available Tiger Lake or Rocket lake CPU?
I never did it myself, but would be very surprised if for the whole decode process, from getting compressed images from SSD to display you will be able to see measurable difference in energy consumption at all. Even for batch-mode from-memory-to-memory decode (but whole process, not just DIMS-friendly parts) I would expect reduction of 10-15% at best.
>
> Unfortunately not that many developers understand yet that SIMD/vectors are widely useful, not just
> in ML/cryptography/HPC/image processing niches.
I am one of those who don't understand, at least as long as HPC is considered widely.
And I consider myself SIMD fan, rather than denier.
> Runtime dispatch for heterogeneous devices (in the
> sense of: a binary doesn't know what type of single-ISA machine it's going to run on: Haswell, Skylake
> etc) is also already a solved problem. AVX-512 is the first x86 SIMD instruction set that's reasonable
> to program (quite complete, useful new instructions for general-purpose applications). Dropping AVX-512
> even in one generation is not a helpful signal for increasing its adoption.
>
> I have not seen any evidence that AVX-512 was disabled because of software necessity or even convenience.
> Multiple people including Linus have said it would be feasible to affinitize-on-fault.
>
> So a question: what exactly are the "far-reaching consequences" you are concerned about?
> Is it the extra cost to the scheduler for checking a "don't move me" counter/flag? Does
> that offset 5x energy efficiency gains in perhaps 10% of software (as a modest target)?
>
> > Let's not forget that only a miniscule fraction of the CPUs ever produced or are
> > likely to be produced in the medium term could benefit from such shenanigans.
> I understand where you're coming from but disagree with this conclusion. Other ISAs also have examples of heavy
> features that might not be feasible/desirable in their equivalent of E-cores, see https://www.realworldtech.com/forum/?threadid=206023&curpostid=206308.
> In particular: AMX and SME for on-device ML. Should those be relegated to servers only? Or never used because
> software only wants to target the lowest common denominator?
> Björn Ragnar Björnsson (bjorn.ragnar.delete@this.gmail.com) on May 23, 2022 5:41 pm wrote:
> > Hopefully they've learned
> > their lesson, to wit they disabled AVX-512 in BIOS updates, didn't they?
> It sounds like you view that as a good thing. My understanding is that people were
> more upset by the removal (especially after initial benchmarks included AVX-512).
>
> Perhaps you prefer to keep the status quo. Here's a problem, though: scheduling an instruction
> on big OoO cores costs considerably more energy than most operations, even FP mul. Thus vectorization
> is pretty much required for energy efficiency (it amortizes that cost across e.g. 16 lanes).
> We're talking 5-10x energy reduction here. Surely that is worth pursuing.
Did you try to measure efficiency gain yourself?
For example, AVX-512 vs AVX2 on your own JPEG-XL decode on widely available Tiger Lake or Rocket lake CPU?
I never did it myself, but would be very surprised if for the whole decode process, from getting compressed images from SSD to display you will be able to see measurable difference in energy consumption at all. Even for batch-mode from-memory-to-memory decode (but whole process, not just DIMS-friendly parts) I would expect reduction of 10-15% at best.
>
> Unfortunately not that many developers understand yet that SIMD/vectors are widely useful, not just
> in ML/cryptography/HPC/image processing niches.
I am one of those who don't understand, at least as long as HPC is considered widely.
And I consider myself SIMD fan, rather than denier.
> Runtime dispatch for heterogeneous devices (in the
> sense of: a binary doesn't know what type of single-ISA machine it's going to run on: Haswell, Skylake
> etc) is also already a solved problem. AVX-512 is the first x86 SIMD instruction set that's reasonable
> to program (quite complete, useful new instructions for general-purpose applications). Dropping AVX-512
> even in one generation is not a helpful signal for increasing its adoption.
>
> I have not seen any evidence that AVX-512 was disabled because of software necessity or even convenience.
> Multiple people including Linus have said it would be feasible to affinitize-on-fault.
>
> So a question: what exactly are the "far-reaching consequences" you are concerned about?
> Is it the extra cost to the scheduler for checking a "don't move me" counter/flag? Does
> that offset 5x energy efficiency gains in perhaps 10% of software (as a modest target)?
>
> > Let's not forget that only a miniscule fraction of the CPUs ever produced or are
> > likely to be produced in the medium term could benefit from such shenanigans.
> I understand where you're coming from but disagree with this conclusion. Other ISAs also have examples of heavy
> features that might not be feasible/desirable in their equivalent of E-cores, see https://www.realworldtech.com/forum/?threadid=206023&curpostid=206308.
> In particular: AMX and SME for on-device ML. Should those be relegated to servers only? Or never used because
> software only wants to target the lowest common denominator?