By: Brendan (btrotter.delete@this.gmail.com), May 23, 2022 9:12 am
Room: Moderated Discussions
Hi,
Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 23, 2022 3:03 am wrote:
> Brendan (btrotter.delete@this.gmail.com) on May 23, 2022 12:41 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 22, 2022 9:49 pm wrote:
> > > Brendan (btrotter.delete@this.gmail.com) on May 22, 2022 8:41 pm wrote:
> > > >
> > > > Libraries? I was mostly talking about normal processes ("generic
> > > > app"). For (shared) libraries you're already
> > > > in a world of suckage because a compiler can't optimize anything
> > > > between caller and callee (even with link-time
> > > > optimization),
> > >
> > > I really get the feeling that you have no idea what you're talking about.
> >
> > I really get the feeling you have no desire to understand what I'm talking about.
> >
> > > People use AVX2 for libraries all the time, and there is no fundamental reason
> > > why AVX512 would be any different.
> >
> > In case you've completely missed the entire conversation; we're talking about differences between cores
> > in heterogeneous systems that have never existed for 80x86
> > before (and remain unaddressed for ARM big.Little),
> > including "same ISA, different instruction timings, different cache sizes, ...". We have never had a
> > situation where AVX2 is supported on some cores but not others in the same system.
> >
> > > Your "optimize between caller and callee"
> > > is a complete red herring and just word salad without any meaning.
> >
> > My "optimize between caller and callee" is the entire reason every major compiler has adopted link-time
> > optimization or link-time code generation (to allow the compiler/linker to optimize even when something
> > is in a different compilation unit). There's currently no equivalent for shared libraries (e.g.
> > like compiling byte-code when software is installed and/or libraries are updated, so that shared
> > libraries can be statically linked and whole program optimization is possible).
> >
> > > You can use vector extensions entirely inside of libraries, and in fact that is traditionally
> > > the common - and almost only use of them outside of HPC. Vector extensions are used
> > > for hashing etc, and when somebody calls various cryptographic functions they often
> > > end up using the vector extensions without ever knowing or caring.
> >
> > Modern compilers auto-vectorize everything they can; and while they could do a better
> > job at it the main problem is that programmers don't let the compiler optimize because
> > they want to publish a single "generic, for all 64-bit 80x86" executable.
> >
> > > This is not some theoretical thing, Brendan. This is reality. This is how 99% of all AVX2
> > > use is done. Almost nobody uses AVX2 directly using compiler intrinsics, it's all done
> > > by calling various library functions that have optimized versions that use AVX2.
> > >
> > > And no, AVX512 is not different in any real way.
> > >
> > > Or rather, it is different if it hits the heterogeneous issues we've discussed, and
> > > it's different in a bad way. As in "uselessly bad", not "minor little problem".
> > >
> > > And exactly like Andrey tried to explain to you, the actual library function that gets used tends
> > > to be picked either at library load time or at first use, and then it is fixed for the lifetime of
> > > the whole process (and fixed across threads). That isn't the only way to do it, no, but it's by
> > > far the common one, and it's one of the major uses of the cpuid instruction in modern programming.
> >
> > For some reason you think my "libraries keep doing what they do now (using the
> > common subset when ISA is different)" is completely different to your "libraries
> > keep doing what they do now (using the common subset when ISA is different)"?
> >
> > > Other somewhat models are installation-time optimizations, or run-time JIT generation,
> > > but none of those really change the end result in any serious way.
> > >
> > > So the application - or the programmer - doesn't know,
> > > and doesn't care, how the library actually implements
> > > whatever crypto function (or memset, or whatever random
> > > library function that decided that "hey, avx512 is a
> > > good idea for this"). In fact, the program may have been compiled
> > > long before new libraries came out that started
> > > using new CPU features, so the whole "programmer doesn't even know" is really really fundamental.
> > >
> > > The whole point of libraries is that they expose interfaces - not implementations.
> > >
> > > And trust me, just because you do "memset()" does not mean that you want to always run on a P-core.
> > > Neither does running some optimized hashing function. And no, that "AVX-512 for crypto" is not
> > > some odd made-up example, it is something that Intel talks about in their white-papers.
> > >
> > > Yet that is literally what you seem to think the solution
> > > is, because you don't understand how the world works.
> > >
> > > And the thing is, if AVX512 isn't usable for random real-world things
> > > like cryptography etc, then AVX512 is simply not worth it AT ALL.
> > >
> > > It really is that simple, and you really are that wrong about libraries.
> >
> > Heh, no. You're just a patronizing fool struggling with reading comprehension,
> > who thinks that alternatives to how things were done in 1960 should never be
> > considered regardless of how much hardware and software has changed since.
> >
> > For AVX-512, the reality is that it's too early for a software
> > developers to publish a "compiled for AVX-512
> > generic executable for 64-bit 80x86" because they'll lose about
> > 90% of the market; and that this availability/adoption
> > problem (which will take about 10 years to dissolve, the same as it did for AVX2 and SSE before that) still
> > has almost nothing to do with support for dissimilar cores we're talking about.
>
> The single biggest user of vector instructions in a dynamic trace of a random binary at work
> is glibc. This despite the fact that all our code is compiled with the correct flags for the
> processor, and we're statically linked with LTO so your shared linking screed is irrelevant.
>
> How does your solution work with libc wanting to use AVX512, as it does today?
Today; if a program or a library wants to use code optimized specifically for P cores or E cores (including different instruction timings, different cache sizes and/or different CPU features/extensions); it simply can not do it at all. E.g. if P cores support AVX-512 and E cores don't, then nothing (a main program or any shared library) is able to use AVX-512 even when its running on a P core anyway.
One solution is to allow software (either the main program only, or main program and all shared libraries) to temporarily disable thread migration to a different CPU type, then determine what kind of CPU it's running on, then use the right code for that CPU type, then re-enable thread migration to a different CPU type when it's finished; such that software can choose to use code that is optimized for P core or E core when the developer wants (or when the developer decides the improved ability to optimize justifies the overhead). Note that this would be orthogonal to partial/limited specialization at program/library initialization (to select code that suits both P cores and E cores but isn't optimized for one or the other).
Originally I was also thinking (and also suggested) that "prefer P core" or "prefer E core" or "no preference" hints be given to the scheduler when the thread is started and/or adjusted by a thread after it's started. This is necessary because (in the presence of "core type dispatch") approaches that auto-guess based on a thread's previous history will auto-guess wrong (e.g. assume a thread should be run on whatever CPU type it was first given merely because the thread adapted itself to suit whatever it was given).
- Brendan
Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 23, 2022 3:03 am wrote:
> Brendan (btrotter.delete@this.gmail.com) on May 23, 2022 12:41 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 22, 2022 9:49 pm wrote:
> > > Brendan (btrotter.delete@this.gmail.com) on May 22, 2022 8:41 pm wrote:
> > > >
> > > > Libraries? I was mostly talking about normal processes ("generic
> > > > app"). For (shared) libraries you're already
> > > > in a world of suckage because a compiler can't optimize anything
> > > > between caller and callee (even with link-time
> > > > optimization),
> > >
> > > I really get the feeling that you have no idea what you're talking about.
> >
> > I really get the feeling you have no desire to understand what I'm talking about.
> >
> > > People use AVX2 for libraries all the time, and there is no fundamental reason
> > > why AVX512 would be any different.
> >
> > In case you've completely missed the entire conversation; we're talking about differences between cores
> > in heterogeneous systems that have never existed for 80x86
> > before (and remain unaddressed for ARM big.Little),
> > including "same ISA, different instruction timings, different cache sizes, ...". We have never had a
> > situation where AVX2 is supported on some cores but not others in the same system.
> >
> > > Your "optimize between caller and callee"
> > > is a complete red herring and just word salad without any meaning.
> >
> > My "optimize between caller and callee" is the entire reason every major compiler has adopted link-time
> > optimization or link-time code generation (to allow the compiler/linker to optimize even when something
> > is in a different compilation unit). There's currently no equivalent for shared libraries (e.g.
> > like compiling byte-code when software is installed and/or libraries are updated, so that shared
> > libraries can be statically linked and whole program optimization is possible).
> >
> > > You can use vector extensions entirely inside of libraries, and in fact that is traditionally
> > > the common - and almost only use of them outside of HPC. Vector extensions are used
> > > for hashing etc, and when somebody calls various cryptographic functions they often
> > > end up using the vector extensions without ever knowing or caring.
> >
> > Modern compilers auto-vectorize everything they can; and while they could do a better
> > job at it the main problem is that programmers don't let the compiler optimize because
> > they want to publish a single "generic, for all 64-bit 80x86" executable.
> >
> > > This is not some theoretical thing, Brendan. This is reality. This is how 99% of all AVX2
> > > use is done. Almost nobody uses AVX2 directly using compiler intrinsics, it's all done
> > > by calling various library functions that have optimized versions that use AVX2.
> > >
> > > And no, AVX512 is not different in any real way.
> > >
> > > Or rather, it is different if it hits the heterogeneous issues we've discussed, and
> > > it's different in a bad way. As in "uselessly bad", not "minor little problem".
> > >
> > > And exactly like Andrey tried to explain to you, the actual library function that gets used tends
> > > to be picked either at library load time or at first use, and then it is fixed for the lifetime of
> > > the whole process (and fixed across threads). That isn't the only way to do it, no, but it's by
> > > far the common one, and it's one of the major uses of the cpuid instruction in modern programming.
> >
> > For some reason you think my "libraries keep doing what they do now (using the
> > common subset when ISA is different)" is completely different to your "libraries
> > keep doing what they do now (using the common subset when ISA is different)"?
> >
> > > Other somewhat models are installation-time optimizations, or run-time JIT generation,
> > > but none of those really change the end result in any serious way.
> > >
> > > So the application - or the programmer - doesn't know,
> > > and doesn't care, how the library actually implements
> > > whatever crypto function (or memset, or whatever random
> > > library function that decided that "hey, avx512 is a
> > > good idea for this"). In fact, the program may have been compiled
> > > long before new libraries came out that started
> > > using new CPU features, so the whole "programmer doesn't even know" is really really fundamental.
> > >
> > > The whole point of libraries is that they expose interfaces - not implementations.
> > >
> > > And trust me, just because you do "memset()" does not mean that you want to always run on a P-core.
> > > Neither does running some optimized hashing function. And no, that "AVX-512 for crypto" is not
> > > some odd made-up example, it is something that Intel talks about in their white-papers.
> > >
> > > Yet that is literally what you seem to think the solution
> > > is, because you don't understand how the world works.
> > >
> > > And the thing is, if AVX512 isn't usable for random real-world things
> > > like cryptography etc, then AVX512 is simply not worth it AT ALL.
> > >
> > > It really is that simple, and you really are that wrong about libraries.
> >
> > Heh, no. You're just a patronizing fool struggling with reading comprehension,
> > who thinks that alternatives to how things were done in 1960 should never be
> > considered regardless of how much hardware and software has changed since.
> >
> > For AVX-512, the reality is that it's too early for a software
> > developers to publish a "compiled for AVX-512
> > generic executable for 64-bit 80x86" because they'll lose about
> > 90% of the market; and that this availability/adoption
> > problem (which will take about 10 years to dissolve, the same as it did for AVX2 and SSE before that) still
> > has almost nothing to do with support for dissimilar cores we're talking about.
>
> The single biggest user of vector instructions in a dynamic trace of a random binary at work
> is glibc. This despite the fact that all our code is compiled with the correct flags for the
> processor, and we're statically linked with LTO so your shared linking screed is irrelevant.
>
> How does your solution work with libc wanting to use AVX512, as it does today?
Today; if a program or a library wants to use code optimized specifically for P cores or E cores (including different instruction timings, different cache sizes and/or different CPU features/extensions); it simply can not do it at all. E.g. if P cores support AVX-512 and E cores don't, then nothing (a main program or any shared library) is able to use AVX-512 even when its running on a P core anyway.
One solution is to allow software (either the main program only, or main program and all shared libraries) to temporarily disable thread migration to a different CPU type, then determine what kind of CPU it's running on, then use the right code for that CPU type, then re-enable thread migration to a different CPU type when it's finished; such that software can choose to use code that is optimized for P core or E core when the developer wants (or when the developer decides the improved ability to optimize justifies the overhead). Note that this would be orthogonal to partial/limited specialization at program/library initialization (to select code that suits both P cores and E cores but isn't optimized for one or the other).
Originally I was also thinking (and also suggested) that "prefer P core" or "prefer E core" or "no preference" hints be given to the scheduler when the thread is started and/or adjusted by a thread after it's started. This is necessary because (in the presence of "core type dispatch") approaches that auto-guess based on a thread's previous history will auto-guess wrong (e.g. assume a thread should be run on whatever CPU type it was first given merely because the thread adapted itself to suit whatever it was given).
- Brendan