By: Brendan (btrotter.delete@this.gmail.com), May 24, 2022 5:09 pm
Room: Moderated Discussions
Hi,
Mark Roulo (nothanks.delete@this.xxx.com) on May 24, 2022 7:41 am wrote:
> Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 24, 2022 4:50 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on May 24, 2022 3:48 am wrote:
> > > Did you try to measure efficiency gain yourself?
> > > For example, AVX-512 vs AVX2 on your own JPEG-XL decode on widely available Tiger Lake or Rocket lake CPU?
> > That would be very interesting, I don't have the equipment but would be happy to work with someone who does.
> > One related result: slide 11 of https://pdfs.semanticscholar.org/f464/74f6ae2dde68d6ccaf9f537b5277b99a466c.pdf.
> >
> > > > Unfortunately not that many developers understand yet that SIMD/vectors are
> > > > widely useful, not just in ML/cryptography/HPC/image processing niches.
> > > I am one of those who don't understand, at least as long as HPC is considered widely.
> > > And I consider myself SIMD fan, rather than denier.
> > :) How about C++ STL functions, can those be considered widely useful? OK, autovectorization
> > will only manage about half of this list (and only for certain types):
> > http://0x80.pl/notesen/2021-01-18-autovectorization-gcc-clang.html
> > A couple more are already implemented in https://github.com/google/highway/tree/master/hwy/contrib/algo
> > - plus sort() in contrib/sort.
> >
> > Sereja's book has a couple more nontrivial ones: https://en.algorithmica.org/hpc/algorithms/argmin/
>
> The argument against heterogeneous cores isn't that SIMD isn't useful.
>
> The argument is that the benefits of AVX2+AVX-512 on some cores and AVX2 only on some cores
> is so small that the downsides of dealing with the heterogeneity outweigh the benefits.
>
> So ... your argument that "vectorization is pretty much required for energy efficiency (it amortizes
> that cost across e.g. 16 lanes). We're talking 5-10x energy reduction here. Surely that is worth pursuing."
> is pretty much a non-sequitur. No one is arguing against using vector instructions.
>
> And no one here is arguing against vectorizing STL.
>
> What folks ARE arguing against is using vector units that are
> unavailable on some percentage of the cores in a system.
>
> You need to make a case that wider vector units (e.g. AVX-512) are valuable enough
> for a percentage of loads that SOME cores should have them, but expensive enough
> and unusable enough that other cores in the same system should NOT have them.
>
> So:
> "AVX-512 is great" leads to towards: "Well then all the cores should support it!"
> "AVX-512 sucks" leads to: "Lets drop AVX-512 and pretend it never happened."
>
> You seem to be saying something like: "It is useful enough to provide, but not so useful to provide
> to all the cores. And having the developers deal with it is an acceptable solution."
>
> I don't think you've made that case. And arguing in favor of SIMD misses the point.
>
> [I'll note in passing that for the most part the developers WON'T deal with it. They will
> either code to the lowest common denominator (AVX2) and ignore the extra capability *OR*
> just peg their app/library/whatever to the core with all the functionality and ignore the
> smaller cores. But this is beside the point. You first need to make the case that AVX-512
> is valuable enough to provide but not so valuable to provide on all the cores.]
For the record; I'm arguing that:
a) even if ISAs are exactly the same there could be up to 10% performance/efficiency improvement because lots of optimizations (instruction selection and scheduling, which instructions a fused or not, prefetch scheduling distance, whether branch prediction has aliasing issues with "too many branches too close", which cache size for cache blocking optimizations, ...) depend on micro-arch (and P cores and E cores use very different micro-arch, and ARM's "big" cores and "little" cores use very different micro-arch)
b) an "up to 10%" performance/efficiency improvement is significant enough to justify the hassle of homogenous CPU support all by itself.
c) once that's done, you also get support for "(slightly) different ISA" with no extra hassle whatsoever; and all the SIMD stuff (e.g. with/without AVX-512) is merely icing on top pushing the potential performance/efficiency improvement even higher.
d) this could've and should've been done 10 years ago; and Intel's "Alder Lake mistakes" are merely a consequence of software developers failing to do it 10 years ago.
I'll also add:
e) failing to do it is going to have more consequences in future. I can almost guarantee that sooner or later Intel will release a chip where one type of core has "undiscovered at time of release" errata; and we'll be facing a micro-code update that wipes out a random extension (will it be TSX again? SGX? AMX? VT-x?) on all cores and not just the effected cores, with people complaining and refusing to install the micro-code update ("I paid for it and use it, why are you disabling it when it works properly on these cores?"), and with software developers left trying to deal with the fallout.
- Brendan
Mark Roulo (nothanks.delete@this.xxx.com) on May 24, 2022 7:41 am wrote:
> Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 24, 2022 4:50 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on May 24, 2022 3:48 am wrote:
> > > Did you try to measure efficiency gain yourself?
> > > For example, AVX-512 vs AVX2 on your own JPEG-XL decode on widely available Tiger Lake or Rocket lake CPU?
> > That would be very interesting, I don't have the equipment but would be happy to work with someone who does.
> > One related result: slide 11 of https://pdfs.semanticscholar.org/f464/74f6ae2dde68d6ccaf9f537b5277b99a466c.pdf.
> >
> > > > Unfortunately not that many developers understand yet that SIMD/vectors are
> > > > widely useful, not just in ML/cryptography/HPC/image processing niches.
> > > I am one of those who don't understand, at least as long as HPC is considered widely.
> > > And I consider myself SIMD fan, rather than denier.
> > :) How about C++ STL functions, can those be considered widely useful? OK, autovectorization
> > will only manage about half of this list (and only for certain types):
> > http://0x80.pl/notesen/2021-01-18-autovectorization-gcc-clang.html
> > A couple more are already implemented in https://github.com/google/highway/tree/master/hwy/contrib/algo
> > - plus sort() in contrib/sort.
> >
> > Sereja's book has a couple more nontrivial ones: https://en.algorithmica.org/hpc/algorithms/argmin/
>
> The argument against heterogeneous cores isn't that SIMD isn't useful.
>
> The argument is that the benefits of AVX2+AVX-512 on some cores and AVX2 only on some cores
> is so small that the downsides of dealing with the heterogeneity outweigh the benefits.
>
> So ... your argument that "vectorization is pretty much required for energy efficiency (it amortizes
> that cost across e.g. 16 lanes). We're talking 5-10x energy reduction here. Surely that is worth pursuing."
> is pretty much a non-sequitur. No one is arguing against using vector instructions.
>
> And no one here is arguing against vectorizing STL.
>
> What folks ARE arguing against is using vector units that are
> unavailable on some percentage of the cores in a system.
>
> You need to make a case that wider vector units (e.g. AVX-512) are valuable enough
> for a percentage of loads that SOME cores should have them, but expensive enough
> and unusable enough that other cores in the same system should NOT have them.
>
> So:
> "AVX-512 is great" leads to towards: "Well then all the cores should support it!"
> "AVX-512 sucks" leads to: "Lets drop AVX-512 and pretend it never happened."
>
> You seem to be saying something like: "It is useful enough to provide, but not so useful to provide
> to all the cores. And having the developers deal with it is an acceptable solution."
>
> I don't think you've made that case. And arguing in favor of SIMD misses the point.
>
> [I'll note in passing that for the most part the developers WON'T deal with it. They will
> either code to the lowest common denominator (AVX2) and ignore the extra capability *OR*
> just peg their app/library/whatever to the core with all the functionality and ignore the
> smaller cores. But this is beside the point. You first need to make the case that AVX-512
> is valuable enough to provide but not so valuable to provide on all the cores.]
For the record; I'm arguing that:
a) even if ISAs are exactly the same there could be up to 10% performance/efficiency improvement because lots of optimizations (instruction selection and scheduling, which instructions a fused or not, prefetch scheduling distance, whether branch prediction has aliasing issues with "too many branches too close", which cache size for cache blocking optimizations, ...) depend on micro-arch (and P cores and E cores use very different micro-arch, and ARM's "big" cores and "little" cores use very different micro-arch)
b) an "up to 10%" performance/efficiency improvement is significant enough to justify the hassle of homogenous CPU support all by itself.
c) once that's done, you also get support for "(slightly) different ISA" with no extra hassle whatsoever; and all the SIMD stuff (e.g. with/without AVX-512) is merely icing on top pushing the potential performance/efficiency improvement even higher.
d) this could've and should've been done 10 years ago; and Intel's "Alder Lake mistakes" are merely a consequence of software developers failing to do it 10 years ago.
I'll also add:
e) failing to do it is going to have more consequences in future. I can almost guarantee that sooner or later Intel will release a chip where one type of core has "undiscovered at time of release" errata; and we'll be facing a micro-code update that wipes out a random extension (will it be TSX again? SGX? AMX? VT-x?) on all cores and not just the effected cores, with people complaining and refusing to install the micro-code update ("I paid for it and use it, why are you disabling it when it works properly on these cores?"), and with software developers left trying to deal with the fallout.
- Brendan