By: Simon Farnsworth (simon.delete@this.farnz.org.uk), May 23, 2022 3:03 am
Room: Moderated Discussions
Brendan (btrotter.delete@this.gmail.com) on May 23, 2022 12:41 am wrote:
> Hi,
>
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 22, 2022 9:49 pm wrote:
> > Brendan (btrotter.delete@this.gmail.com) on May 22, 2022 8:41 pm wrote:
> > >
> > > Libraries? I was mostly talking about normal processes ("generic
> > > app"). For (shared) libraries you're already
> > > in a world of suckage because a compiler can't optimize anything
> > > between caller and callee (even with link-time
> > > optimization),
> >
> > I really get the feeling that you have no idea what you're talking about.
>
> I really get the feeling you have no desire to understand what I'm talking about.
>
> > People use AVX2 for libraries all the time, and there is no fundamental reason
> > why AVX512 would be any different.
>
> In case you've completely missed the entire conversation; we're talking about differences between cores
> in heterogeneous systems that have never existed for 80x86 before (and remain unaddressed for ARM big.Little),
> including "same ISA, different instruction timings, different cache sizes, ...". We have never had a
> situation where AVX2 is supported on some cores but not others in the same system.
>
> > Your "optimize between caller and callee"
> > is a complete red herring and just word salad without any meaning.
>
> My "optimize between caller and callee" is the entire reason every major compiler has adopted link-time
> optimization or link-time code generation (to allow the compiler/linker to optimize even when something
> is in a different compilation unit). There's currently no equivalent for shared libraries (e.g.
> like compiling byte-code when software is installed and/or libraries are updated, so that shared
> libraries can be statically linked and whole program optimization is possible).
>
> > You can use vector extensions entirely inside of libraries, and in fact that is traditionally
> > the common - and almost only use of them outside of HPC. Vector extensions are used
> > for hashing etc, and when somebody calls various cryptographic functions they often
> > end up using the vector extensions without ever knowing or caring.
>
> Modern compilers auto-vectorize everything they can; and while they could do a better
> job at it the main problem is that programmers don't let the compiler optimize because
> they want to publish a single "generic, for all 64-bit 80x86" executable.
>
> > This is not some theoretical thing, Brendan. This is reality. This is how 99% of all AVX2
> > use is done. Almost nobody uses AVX2 directly using compiler intrinsics, it's all done
> > by calling various library functions that have optimized versions that use AVX2.
> >
> > And no, AVX512 is not different in any real way.
> >
> > Or rather, it is different if it hits the heterogeneous issues we've discussed, and
> > it's different in a bad way. As in "uselessly bad", not "minor little problem".
> >
> > And exactly like Andrey tried to explain to you, the actual library function that gets used tends
> > to be picked either at library load time or at first use, and then it is fixed for the lifetime of
> > the whole process (and fixed across threads). That isn't the only way to do it, no, but it's by
> > far the common one, and it's one of the major uses of the cpuid instruction in modern programming.
>
> For some reason you think my "libraries keep doing what they do now (using the
> common subset when ISA is different)" is completely different to your "libraries
> keep doing what they do now (using the common subset when ISA is different)"?
>
> > Other somewhat models are installation-time optimizations, or run-time JIT generation,
> > but none of those really change the end result in any serious way.
> >
> > So the application - or the programmer - doesn't know,
> > and doesn't care, how the library actually implements
> > whatever crypto function (or memset, or whatever random
> > library function that decided that "hey, avx512 is a
> > good idea for this"). In fact, the program may have been compiled
> > long before new libraries came out that started
> > using new CPU features, so the whole "programmer doesn't even know" is really really fundamental.
> >
> > The whole point of libraries is that they expose interfaces - not implementations.
> >
> > And trust me, just because you do "memset()" does not mean that you want to always run on a P-core.
> > Neither does running some optimized hashing function. And no, that "AVX-512 for crypto" is not
> > some odd made-up example, it is something that Intel talks about in their white-papers.
> >
> > Yet that is literally what you seem to think the solution
> > is, because you don't understand how the world works.
> >
> > And the thing is, if AVX512 isn't usable for random real-world things
> > like cryptography etc, then AVX512 is simply not worth it AT ALL.
> >
> > It really is that simple, and you really are that wrong about libraries.
>
> Heh, no. You're just a patronizing fool struggling with reading comprehension,
> who thinks that alternatives to how things were done in 1960 should never be
> considered regardless of how much hardware and software has changed since.
>
> For AVX-512, the reality is that it's too early for a software developers to publish a "compiled for AVX-512
> generic executable for 64-bit 80x86" because they'll lose about 90% of the market; and that this availability/adoption
> problem (which will take about 10 years to dissolve, the same as it did for AVX2 and SSE before that) still
> has almost nothing to do with support for dissimilar cores we're talking about.
>
> - Brendan
>
The single biggest user of vector instructions in a dynamic trace of a random binary at work is glibc. This despite the fact that all our code is compiled with the correct flags for the processor, and we're statically linked with LTO so your shared linking screed is irrelevant.
How does your solution work with libc wanting to use AVX512, as it does today?
> Hi,
>
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 22, 2022 9:49 pm wrote:
> > Brendan (btrotter.delete@this.gmail.com) on May 22, 2022 8:41 pm wrote:
> > >
> > > Libraries? I was mostly talking about normal processes ("generic
> > > app"). For (shared) libraries you're already
> > > in a world of suckage because a compiler can't optimize anything
> > > between caller and callee (even with link-time
> > > optimization),
> >
> > I really get the feeling that you have no idea what you're talking about.
>
> I really get the feeling you have no desire to understand what I'm talking about.
>
> > People use AVX2 for libraries all the time, and there is no fundamental reason
> > why AVX512 would be any different.
>
> In case you've completely missed the entire conversation; we're talking about differences between cores
> in heterogeneous systems that have never existed for 80x86 before (and remain unaddressed for ARM big.Little),
> including "same ISA, different instruction timings, different cache sizes, ...". We have never had a
> situation where AVX2 is supported on some cores but not others in the same system.
>
> > Your "optimize between caller and callee"
> > is a complete red herring and just word salad without any meaning.
>
> My "optimize between caller and callee" is the entire reason every major compiler has adopted link-time
> optimization or link-time code generation (to allow the compiler/linker to optimize even when something
> is in a different compilation unit). There's currently no equivalent for shared libraries (e.g.
> like compiling byte-code when software is installed and/or libraries are updated, so that shared
> libraries can be statically linked and whole program optimization is possible).
>
> > You can use vector extensions entirely inside of libraries, and in fact that is traditionally
> > the common - and almost only use of them outside of HPC. Vector extensions are used
> > for hashing etc, and when somebody calls various cryptographic functions they often
> > end up using the vector extensions without ever knowing or caring.
>
> Modern compilers auto-vectorize everything they can; and while they could do a better
> job at it the main problem is that programmers don't let the compiler optimize because
> they want to publish a single "generic, for all 64-bit 80x86" executable.
>
> > This is not some theoretical thing, Brendan. This is reality. This is how 99% of all AVX2
> > use is done. Almost nobody uses AVX2 directly using compiler intrinsics, it's all done
> > by calling various library functions that have optimized versions that use AVX2.
> >
> > And no, AVX512 is not different in any real way.
> >
> > Or rather, it is different if it hits the heterogeneous issues we've discussed, and
> > it's different in a bad way. As in "uselessly bad", not "minor little problem".
> >
> > And exactly like Andrey tried to explain to you, the actual library function that gets used tends
> > to be picked either at library load time or at first use, and then it is fixed for the lifetime of
> > the whole process (and fixed across threads). That isn't the only way to do it, no, but it's by
> > far the common one, and it's one of the major uses of the cpuid instruction in modern programming.
>
> For some reason you think my "libraries keep doing what they do now (using the
> common subset when ISA is different)" is completely different to your "libraries
> keep doing what they do now (using the common subset when ISA is different)"?
>
> > Other somewhat models are installation-time optimizations, or run-time JIT generation,
> > but none of those really change the end result in any serious way.
> >
> > So the application - or the programmer - doesn't know,
> > and doesn't care, how the library actually implements
> > whatever crypto function (or memset, or whatever random
> > library function that decided that "hey, avx512 is a
> > good idea for this"). In fact, the program may have been compiled
> > long before new libraries came out that started
> > using new CPU features, so the whole "programmer doesn't even know" is really really fundamental.
> >
> > The whole point of libraries is that they expose interfaces - not implementations.
> >
> > And trust me, just because you do "memset()" does not mean that you want to always run on a P-core.
> > Neither does running some optimized hashing function. And no, that "AVX-512 for crypto" is not
> > some odd made-up example, it is something that Intel talks about in their white-papers.
> >
> > Yet that is literally what you seem to think the solution
> > is, because you don't understand how the world works.
> >
> > And the thing is, if AVX512 isn't usable for random real-world things
> > like cryptography etc, then AVX512 is simply not worth it AT ALL.
> >
> > It really is that simple, and you really are that wrong about libraries.
>
> Heh, no. You're just a patronizing fool struggling with reading comprehension,
> who thinks that alternatives to how things were done in 1960 should never be
> considered regardless of how much hardware and software has changed since.
>
> For AVX-512, the reality is that it's too early for a software developers to publish a "compiled for AVX-512
> generic executable for 64-bit 80x86" because they'll lose about 90% of the market; and that this availability/adoption
> problem (which will take about 10 years to dissolve, the same as it did for AVX2 and SSE before that) still
> has almost nothing to do with support for dissimilar cores we're talking about.
>
> - Brendan
>
The single biggest user of vector instructions in a dynamic trace of a random binary at work is glibc. This despite the fact that all our code is compiled with the correct flags for the processor, and we're statically linked with LTO so your shared linking screed is irrelevant.
How does your solution work with libc wanting to use AVX512, as it does today?