By: Mark Roulo (nothanks.delete@this.xxx.com), May 24, 2022 7:41 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 24, 2022 4:50 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on May 24, 2022 3:48 am wrote:
> > Did you try to measure efficiency gain yourself?
> > For example, AVX-512 vs AVX2 on your own JPEG-XL decode on widely available Tiger Lake or Rocket lake CPU?
> That would be very interesting, I don't have the equipment but would be happy to work with someone who does.
> One related result: slide 11 of https://pdfs.semanticscholar.org/f464/74f6ae2dde68d6ccaf9f537b5277b99a466c.pdf.
>
> > > Unfortunately not that many developers understand yet that SIMD/vectors are
> > > widely useful, not just in ML/cryptography/HPC/image processing niches.
> > I am one of those who don't understand, at least as long as HPC is considered widely.
> > And I consider myself SIMD fan, rather than denier.
> :) How about C++ STL functions, can those be considered widely useful? OK, autovectorization
> will only manage about half of this list (and only for certain types): http://0x80.pl/notesen/2021-01-18-autovectorization-gcc-clang.html
> A couple more are already implemented in https://github.com/google/highway/tree/master/hwy/contrib/algo
> - plus sort() in contrib/sort.
>
> Sereja's book has a couple more nontrivial ones: https://en.algorithmica.org/hpc/algorithms/argmin/
The argument against heterogeneous cores isn't that SIMD isn't useful.
The argument is that the benefits of AVX2+AVX-512 on some cores and AVX2 only on some cores is so small that the downsides of dealing with the heterogeneity outweigh the benefits.
So ... your argument that "vectorization is pretty much required for energy efficiency (it amortizes that cost across e.g. 16 lanes). We're talking 5-10x energy reduction here. Surely that is worth pursuing." is pretty much a non-sequitur. No one is arguing against using vector instructions.
And no one here is arguing against vectorizing STL.
What folks ARE arguing against is using vector units that are unavailable on some percentage of the cores in a system.
You need to make a case that wider vector units (e.g. AVX-512) are valuable enough for a percentage of loads that SOME cores should have them, but expensive enough and unusable enough that other cores in the same system should NOT have them.
So:
"AVX-512 is great" leads to towards: "Well then all the cores should support it!"
"AVX-512 sucks" leads to: "Lets drop AVX-512 and pretend it never happened."
You seem to be saying something like: "It is useful enough to provide, but not so useful to provide to all the cores. And having the developers deal with it is an acceptable solution."
I don't think you've made that case. And arguing in favor of SIMD misses the point.
[I'll note in passing that for the most part the developers WON'T deal with it. They will either code to the lowest common denominator (AVX2) and ignore the extra capability *OR* just peg their app/library/whatever to the core with all the functionality and ignore the smaller cores. But this is beside the point. You first need to make the case that AVX-512 is valuable enough to provide but not so valuable to provide on all the cores.]
> Michael S (already5chosen.delete@this.yahoo.com) on May 24, 2022 3:48 am wrote:
> > Did you try to measure efficiency gain yourself?
> > For example, AVX-512 vs AVX2 on your own JPEG-XL decode on widely available Tiger Lake or Rocket lake CPU?
> That would be very interesting, I don't have the equipment but would be happy to work with someone who does.
> One related result: slide 11 of https://pdfs.semanticscholar.org/f464/74f6ae2dde68d6ccaf9f537b5277b99a466c.pdf.
>
> > > Unfortunately not that many developers understand yet that SIMD/vectors are
> > > widely useful, not just in ML/cryptography/HPC/image processing niches.
> > I am one of those who don't understand, at least as long as HPC is considered widely.
> > And I consider myself SIMD fan, rather than denier.
> :) How about C++ STL functions, can those be considered widely useful? OK, autovectorization
> will only manage about half of this list (and only for certain types): http://0x80.pl/notesen/2021-01-18-autovectorization-gcc-clang.html
> A couple more are already implemented in https://github.com/google/highway/tree/master/hwy/contrib/algo
> - plus sort() in contrib/sort.
>
> Sereja's book has a couple more nontrivial ones: https://en.algorithmica.org/hpc/algorithms/argmin/
The argument against heterogeneous cores isn't that SIMD isn't useful.
The argument is that the benefits of AVX2+AVX-512 on some cores and AVX2 only on some cores is so small that the downsides of dealing with the heterogeneity outweigh the benefits.
So ... your argument that "vectorization is pretty much required for energy efficiency (it amortizes that cost across e.g. 16 lanes). We're talking 5-10x energy reduction here. Surely that is worth pursuing." is pretty much a non-sequitur. No one is arguing against using vector instructions.
And no one here is arguing against vectorizing STL.
What folks ARE arguing against is using vector units that are unavailable on some percentage of the cores in a system.
You need to make a case that wider vector units (e.g. AVX-512) are valuable enough for a percentage of loads that SOME cores should have them, but expensive enough and unusable enough that other cores in the same system should NOT have them.
So:
"AVX-512 is great" leads to towards: "Well then all the cores should support it!"
"AVX-512 sucks" leads to: "Lets drop AVX-512 and pretend it never happened."
You seem to be saying something like: "It is useful enough to provide, but not so useful to provide to all the cores. And having the developers deal with it is an acceptable solution."
I don't think you've made that case. And arguing in favor of SIMD misses the point.
[I'll note in passing that for the most part the developers WON'T deal with it. They will either code to the lowest common denominator (AVX2) and ignore the extra capability *OR* just peg their app/library/whatever to the core with all the functionality and ignore the smaller cores. But this is beside the point. You first need to make the case that AVX-512 is valuable enough to provide but not so valuable to provide on all the cores.]