By: Simon Farnsworth (simon.delete@this.farnz.org.uk), May 24, 2022 4:18 am
Room: Moderated Discussions
Brendan (btrotter.delete@this.gmail.com) on May 23, 2022 9:12 am wrote:
> Hi,
>
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 23, 2022 3:03 am wrote:
> > Brendan (btrotter.delete@this.gmail.com) on May 23, 2022 12:41 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 22, 2022 9:49 pm wrote:
> > > > Brendan (btrotter.delete@this.gmail.com) on May 22, 2022 8:41 pm wrote:
> > > > >
> > > > > Libraries? I was mostly talking about normal processes ("generic
> > > > > app"). For (shared) libraries you're already
> > > > > in a world of suckage because a compiler can't optimize anything
> > > > > between caller and callee (even with link-time
> > > > > optimization),
> > > >
> > > > I really get the feeling that you have no idea what you're talking about.
> > >
> > > I really get the feeling you have no desire to understand what I'm talking about.
> > >
> > > > People use AVX2 for libraries all the time, and there is no fundamental reason
> > > > why AVX512 would be any different.
> > >
> > > In case you've completely missed the entire conversation; we're talking about differences between cores
> > > in heterogeneous systems that have never existed for 80x86
> > > before (and remain unaddressed for ARM big.Little),
> > > including "same ISA, different instruction timings, different cache sizes, ...". We have never had a
> > > situation where AVX2 is supported on some cores but not others in the same system.
> > >
> > > > Your "optimize between caller and callee"
> > > > is a complete red herring and just word salad without any meaning.
> > >
> > > My "optimize between caller and callee" is the entire reason every major compiler has adopted link-time
> > > optimization or link-time code generation (to allow the compiler/linker to optimize even when something
> > > is in a different compilation unit). There's currently no equivalent for shared libraries (e.g.
> > > like compiling byte-code when software is installed and/or libraries are updated, so that shared
> > > libraries can be statically linked and whole program optimization is possible).
> > >
> > > > You can use vector extensions entirely inside of libraries, and in fact that is traditionally
> > > > the common - and almost only use of them outside of HPC. Vector extensions are used
> > > > for hashing etc, and when somebody calls various cryptographic functions they often
> > > > end up using the vector extensions without ever knowing or caring.
> > >
> > > Modern compilers auto-vectorize everything they can; and while they could do a better
> > > job at it the main problem is that programmers don't let the compiler optimize because
> > > they want to publish a single "generic, for all 64-bit 80x86" executable.
> > >
> > > > This is not some theoretical thing, Brendan. This is reality. This is how 99% of all AVX2
> > > > use is done. Almost nobody uses AVX2 directly using compiler intrinsics, it's all done
> > > > by calling various library functions that have optimized versions that use AVX2.
> > > >
> > > > And no, AVX512 is not different in any real way.
> > > >
> > > > Or rather, it is different if it hits the heterogeneous issues we've discussed, and
> > > > it's different in a bad way. As in "uselessly bad", not "minor little problem".
> > > >
> > > > And exactly like Andrey tried to explain to you, the actual library function that gets used tends
> > > > to be picked either at library load time or at first use, and then it is fixed for the lifetime of
> > > > the whole process (and fixed across threads). That isn't the only way to do it, no, but it's by
> > > > far the common one, and it's one of the major uses of the cpuid instruction in modern programming.
> > >
> > > For some reason you think my "libraries keep doing what they do now (using the
> > > common subset when ISA is different)" is completely different to your "libraries
> > > keep doing what they do now (using the common subset when ISA is different)"?
> > >
> > > > Other somewhat models are installation-time optimizations, or run-time JIT generation,
> > > > but none of those really change the end result in any serious way.
> > > >
> > > > So the application - or the programmer - doesn't know,
> > > > and doesn't care, how the library actually implements
> > > > whatever crypto function (or memset, or whatever random
> > > > library function that decided that "hey, avx512 is a
> > > > good idea for this"). In fact, the program may have been compiled
> > > > long before new libraries came out that started
> > > > using new CPU features, so the whole "programmer doesn't even know" is really really fundamental.
> > > >
> > > > The whole point of libraries is that they expose interfaces - not implementations.
> > > >
> > > > And trust me, just because you do "memset()" does not mean that you want to always run on a P-core.
> > > > Neither does running some optimized hashing function. And no, that "AVX-512 for crypto" is not
> > > > some odd made-up example, it is something that Intel talks about in their white-papers.
> > > >
> > > > Yet that is literally what you seem to think the solution
> > > > is, because you don't understand how the world works.
> > > >
> > > > And the thing is, if AVX512 isn't usable for random real-world things
> > > > like cryptography etc, then AVX512 is simply not worth it AT ALL.
> > > >
> > > > It really is that simple, and you really are that wrong about libraries.
> > >
> > > Heh, no. You're just a patronizing fool struggling with reading comprehension,
> > > who thinks that alternatives to how things were done in 1960 should never be
> > > considered regardless of how much hardware and software has changed since.
> > >
> > > For AVX-512, the reality is that it's too early for a software
> > > developers to publish a "compiled for AVX-512
> > > generic executable for 64-bit 80x86" because they'll lose about
> > > 90% of the market; and that this availability/adoption
> > > problem (which will take about 10 years to dissolve, the same as it did for AVX2 and SSE before that) still
> > > has almost nothing to do with support for dissimilar cores we're talking about.
> >
> > The single biggest user of vector instructions in a dynamic trace of a random binary at work
> > is glibc. This despite the fact that all our code is compiled with the correct flags for the
> > processor, and we're statically linked with LTO so your shared linking screed is irrelevant.
> >
> > How does your solution work with libc wanting to use AVX512, as it does today?
>
> Today; if a program or a library wants to use code optimized specifically for P cores or E cores (including
> different instruction timings, different cache sizes and/or different CPU features/extensions); it simply
> can not do it at all. E.g. if P cores support AVX-512 and E cores don't, then nothing (a main program
> or any shared library) is able to use AVX-512 even when its running on a P core anyway.
>
> One solution is to allow software (either the main program only, or main program and all shared libraries)
> to temporarily disable thread migration to a different CPU type, then determine what kind of CPU it's running
> on, then use the right code for that CPU type, then re-enable thread migration to a different CPU type when
> it's finished; such that software can choose to use code that is optimized for P core or E core when the
> developer wants (or when the developer decides the improved ability to optimize justifies the overhead).
> Note that this would be orthogonal to partial/limited specialization at program/library initialization (to
> select code that suits both P cores and E cores but isn't optimized for one or the other).
>
How does glibc do this efficiently for small inlined functions like strstr? How does it correctly choose P versus E for its vectorized strstr implementation? Do I, the application developer, have to change my language runtime so that it correctly hints P versus E to glibc?
> Originally I was also thinking (and also suggested) that "prefer P core" or "prefer E core" or "no preference"
> hints be given to the scheduler when the thread is started and/or adjusted by a thread after it's started.
> This is necessary because (in the presence of "core type dispatch") approaches that auto-guess based on
> a thread's previous history will auto-guess wrong (e.g. assume a thread should be run on whatever CPU type
> it was first given merely because the thread adapted itself to suit whatever it was given).
>
It's not hard to do a bad job of supporting heterogeneous ISAs. The hard part is doing a good job - how does glibc know whether it can use the P core only version of strstr, or whether it might be context switched to an E core? How can it efficiently (i.e. no syscalls) prevent context switching to an E core in the middle of using a P core only version of a function? How do you avoid the application and library code disagreeing over which cores are permissible right now, bearing in mind that it's a crash if the application expects P cores, but the library permits the process to be moved to E cores in error? How do you ensure that glibc doesn't switch you to P cores when you're a background task, but also doesn't waste performance using the E core version of a function when you're a foreground task intending to use the P cores?
This is a surprisingly complex area even when you're running different processes on different machines. IBM's solution for Z Series, where money is basically not an object, is to say that processes are only told about the lowest common denominator features of the processors they could run on, and to enable what would be P core only features on Intel, you have to confine a process to P cores (otherwise they get E core only features, even though a subset of processors can run P core code).
I'd love to see a complete design that allowed you to have one binary that can be transparently migrated between core types at runtime, but so far, all the solutions I've seen to multiple ISAs involve choosing the the ISA at link time (often deferred until dynamic linking as the process starts, but still link time).
> Hi,
>
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 23, 2022 3:03 am wrote:
> > Brendan (btrotter.delete@this.gmail.com) on May 23, 2022 12:41 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 22, 2022 9:49 pm wrote:
> > > > Brendan (btrotter.delete@this.gmail.com) on May 22, 2022 8:41 pm wrote:
> > > > >
> > > > > Libraries? I was mostly talking about normal processes ("generic
> > > > > app"). For (shared) libraries you're already
> > > > > in a world of suckage because a compiler can't optimize anything
> > > > > between caller and callee (even with link-time
> > > > > optimization),
> > > >
> > > > I really get the feeling that you have no idea what you're talking about.
> > >
> > > I really get the feeling you have no desire to understand what I'm talking about.
> > >
> > > > People use AVX2 for libraries all the time, and there is no fundamental reason
> > > > why AVX512 would be any different.
> > >
> > > In case you've completely missed the entire conversation; we're talking about differences between cores
> > > in heterogeneous systems that have never existed for 80x86
> > > before (and remain unaddressed for ARM big.Little),
> > > including "same ISA, different instruction timings, different cache sizes, ...". We have never had a
> > > situation where AVX2 is supported on some cores but not others in the same system.
> > >
> > > > Your "optimize between caller and callee"
> > > > is a complete red herring and just word salad without any meaning.
> > >
> > > My "optimize between caller and callee" is the entire reason every major compiler has adopted link-time
> > > optimization or link-time code generation (to allow the compiler/linker to optimize even when something
> > > is in a different compilation unit). There's currently no equivalent for shared libraries (e.g.
> > > like compiling byte-code when software is installed and/or libraries are updated, so that shared
> > > libraries can be statically linked and whole program optimization is possible).
> > >
> > > > You can use vector extensions entirely inside of libraries, and in fact that is traditionally
> > > > the common - and almost only use of them outside of HPC. Vector extensions are used
> > > > for hashing etc, and when somebody calls various cryptographic functions they often
> > > > end up using the vector extensions without ever knowing or caring.
> > >
> > > Modern compilers auto-vectorize everything they can; and while they could do a better
> > > job at it the main problem is that programmers don't let the compiler optimize because
> > > they want to publish a single "generic, for all 64-bit 80x86" executable.
> > >
> > > > This is not some theoretical thing, Brendan. This is reality. This is how 99% of all AVX2
> > > > use is done. Almost nobody uses AVX2 directly using compiler intrinsics, it's all done
> > > > by calling various library functions that have optimized versions that use AVX2.
> > > >
> > > > And no, AVX512 is not different in any real way.
> > > >
> > > > Or rather, it is different if it hits the heterogeneous issues we've discussed, and
> > > > it's different in a bad way. As in "uselessly bad", not "minor little problem".
> > > >
> > > > And exactly like Andrey tried to explain to you, the actual library function that gets used tends
> > > > to be picked either at library load time or at first use, and then it is fixed for the lifetime of
> > > > the whole process (and fixed across threads). That isn't the only way to do it, no, but it's by
> > > > far the common one, and it's one of the major uses of the cpuid instruction in modern programming.
> > >
> > > For some reason you think my "libraries keep doing what they do now (using the
> > > common subset when ISA is different)" is completely different to your "libraries
> > > keep doing what they do now (using the common subset when ISA is different)"?
> > >
> > > > Other somewhat models are installation-time optimizations, or run-time JIT generation,
> > > > but none of those really change the end result in any serious way.
> > > >
> > > > So the application - or the programmer - doesn't know,
> > > > and doesn't care, how the library actually implements
> > > > whatever crypto function (or memset, or whatever random
> > > > library function that decided that "hey, avx512 is a
> > > > good idea for this"). In fact, the program may have been compiled
> > > > long before new libraries came out that started
> > > > using new CPU features, so the whole "programmer doesn't even know" is really really fundamental.
> > > >
> > > > The whole point of libraries is that they expose interfaces - not implementations.
> > > >
> > > > And trust me, just because you do "memset()" does not mean that you want to always run on a P-core.
> > > > Neither does running some optimized hashing function. And no, that "AVX-512 for crypto" is not
> > > > some odd made-up example, it is something that Intel talks about in their white-papers.
> > > >
> > > > Yet that is literally what you seem to think the solution
> > > > is, because you don't understand how the world works.
> > > >
> > > > And the thing is, if AVX512 isn't usable for random real-world things
> > > > like cryptography etc, then AVX512 is simply not worth it AT ALL.
> > > >
> > > > It really is that simple, and you really are that wrong about libraries.
> > >
> > > Heh, no. You're just a patronizing fool struggling with reading comprehension,
> > > who thinks that alternatives to how things were done in 1960 should never be
> > > considered regardless of how much hardware and software has changed since.
> > >
> > > For AVX-512, the reality is that it's too early for a software
> > > developers to publish a "compiled for AVX-512
> > > generic executable for 64-bit 80x86" because they'll lose about
> > > 90% of the market; and that this availability/adoption
> > > problem (which will take about 10 years to dissolve, the same as it did for AVX2 and SSE before that) still
> > > has almost nothing to do with support for dissimilar cores we're talking about.
> >
> > The single biggest user of vector instructions in a dynamic trace of a random binary at work
> > is glibc. This despite the fact that all our code is compiled with the correct flags for the
> > processor, and we're statically linked with LTO so your shared linking screed is irrelevant.
> >
> > How does your solution work with libc wanting to use AVX512, as it does today?
>
> Today; if a program or a library wants to use code optimized specifically for P cores or E cores (including
> different instruction timings, different cache sizes and/or different CPU features/extensions); it simply
> can not do it at all. E.g. if P cores support AVX-512 and E cores don't, then nothing (a main program
> or any shared library) is able to use AVX-512 even when its running on a P core anyway.
>
> One solution is to allow software (either the main program only, or main program and all shared libraries)
> to temporarily disable thread migration to a different CPU type, then determine what kind of CPU it's running
> on, then use the right code for that CPU type, then re-enable thread migration to a different CPU type when
> it's finished; such that software can choose to use code that is optimized for P core or E core when the
> developer wants (or when the developer decides the improved ability to optimize justifies the overhead).
> Note that this would be orthogonal to partial/limited specialization at program/library initialization (to
> select code that suits both P cores and E cores but isn't optimized for one or the other).
>
How does glibc do this efficiently for small inlined functions like strstr? How does it correctly choose P versus E for its vectorized strstr implementation? Do I, the application developer, have to change my language runtime so that it correctly hints P versus E to glibc?
> Originally I was also thinking (and also suggested) that "prefer P core" or "prefer E core" or "no preference"
> hints be given to the scheduler when the thread is started and/or adjusted by a thread after it's started.
> This is necessary because (in the presence of "core type dispatch") approaches that auto-guess based on
> a thread's previous history will auto-guess wrong (e.g. assume a thread should be run on whatever CPU type
> it was first given merely because the thread adapted itself to suit whatever it was given).
>
It's not hard to do a bad job of supporting heterogeneous ISAs. The hard part is doing a good job - how does glibc know whether it can use the P core only version of strstr, or whether it might be context switched to an E core? How can it efficiently (i.e. no syscalls) prevent context switching to an E core in the middle of using a P core only version of a function? How do you avoid the application and library code disagreeing over which cores are permissible right now, bearing in mind that it's a crash if the application expects P cores, but the library permits the process to be moved to E cores in error? How do you ensure that glibc doesn't switch you to P cores when you're a background task, but also doesn't waste performance using the E core version of a function when you're a foreground task intending to use the P cores?
This is a surprisingly complex area even when you're running different processes on different machines. IBM's solution for Z Series, where money is basically not an object, is to say that processes are only told about the lowest common denominator features of the processors they could run on, and to enable what would be P core only features on Intel, you have to confine a process to P cores (otherwise they get E core only features, even though a subset of processors can run P core code).
I'd love to see a complete design that allowed you to have one binary that can be transparently migrated between core types at runtime, but so far, all the solutions I've seen to multiple ISAs involve choosing the the ISA at link time (often deferred until dynamic linking as the process starts, but still link time).