By: Brendan (btrotter.delete@this.gmail.com), May 24, 2022 9:53 pm
Room: Moderated Discussions
Hi,
Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 24, 2022 2:14 pm wrote:
> Brendan (btrotter.delete@this.gmail.com) on May 24, 2022 1:44 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 23, 2022 4:54 pm wrote:
> > > Brendan (btrotter.delete@this.gmail.com) on May 23, 2022 1:13 pm wrote:
> > > >
> > > > Mental masturbation is things like circular logic - e.g.
> > > > "I don't want to support anything except the common
> > > > case, because the common case is useless, because I didn't want to support anything except the common case".
> > >
> > > It's not about me supporting it.
> > >
> > > The kernel side is fairly trivial. It ranges from "no changes at all" (ie users just do their own
> > > CPU affinity to deal with it) to "minimal changes" (some ELF flag to say "start with this affinity")
> > > to fairly straightforward bigger support (eg "fault-on-use and auto-affine the thread").
> > >
> > > In fact, when I first heard of Intel's heterogeneous model
> > > in Alder Lake, I was like "we can support that easily".
> > >
> > > Because on the kernel side, it really is mostly a non-issue. Any kernel use of AVX512
> > > is already very limited (I think we have a couple of optimized crypto library functions),
> > > and the kernel already obviously supports CPU affinities. It's stupid special-case
> > > code, but it's not necessarily complicated stupid special-case code.
> > >
> > > (Of course, anything to do with the x86 extended FP state is actually fairly complicated to
> > > begin with, because of how it's all oddly lumped together in "xstate" and has about a billion
> > > different variations, so adding m ore special cases to that code is never a good thing).
> > >
> > > So no. My argument is not at all "I don't want to support it",
> > > and you haven't heard that argument here in this thread.
> > >
> > > My argument is "it's stupid and doesn't work in user space, and any silicon that implements that
> > > heterogeneous model is just wasted space by hardware designers who couldn't do it right".
> > >
> > > Because in practice that heterogeneous model means that 99% of users will never use that AVX512 hardware,
> > > since 99% of users are all in libraries, and I hope I have explained why they would not use it.
> >
> > You have made it clear that supporting it in dynamically linked libraries is currently
> > more important than supporting it in programs and statically linked libraries.
> >
> > But... here's where we're having a problem:
> >
> > You have a habit of assuming that the past (before new technology or new capabilities
> > are introduced) accurately predicts the future (after new technology or new capabilities
> > are introduced, and after the inevitable adoption period).
> >
> > In 2012, you didn't say "0% of software uses SSE2, therefore
> > no software will use SSE2 in future, so supporting
> > SSE2 is silly and doesn't work" (or maybe you did but I
> > doubt it). You understood that introducing something
> > new changes the future; and for 80x86 PCs often it takes
> > about 10 years between the introduction of something
> > new (64-bit 80x86, SSE2, UEFI, Wayland, SystemD, ...) and the end of the adoption period.
> >
> > More specifically; for AVX-512 I think we agree that Intel bungled adoption badly (first making it
> > "HPC only" to ensure its failure because almost nobody has a reason to care, then splitting it into
> > far too many sub-features, then the "rushed" Alder Lake mess just when AVX-512 was starting to gain
> > adoption). Because of this it's like we're currently only 20% of the way into the adoption period,
> > and the statistics you can get today (for how much software of what type uses AVX-512) are relatively
> > worthless (a poor indicator of what "max. adoption" will look like in 10 years time).
> >
> > More specifically; for heterogeneous CPUs there are 3 cases:
> >
> > a) "same ISA, different performance characteristics"; where allowing software to select code to
> > suit the CPU type (with different optimization) is merely a small performance improvement and not
> > strictly required. If support for this was added to the Linux kernel today it'd probably take 5
> > years to get an accurate prediction of how much of what kind of software uses it, and 10 years
> > until you approach "max. adoption". Any statistics you find today are completely irrelevant.
> >
> This is something that exists today, and that the Linux kernel has decent support for
> already (and people are working on making it better). It's called big.LITTLE in the Arm
> world, and is well-supported because it exists for very good technical reasons.
"Support" exists (for power management and load balancing); but it's not like a normal user-space program can tell the kernel "prefer (but don't restrict to) a little core for this work"; and not like programmers can do something like "#pragma CPU_type_preference(little)" in a few places and let the compiler propagate the optimization hints throughout the call graph.
> > b) "slightly different ISA (e.g. with or without AVX512)"; where allowing software to select code
> > to suit the CPU type (with different optimization) is still just a performance improvement (over
> > just using the common subset) and not strictly required, but likely to be a larger performance
> > difference. If support for this was added to the Linux kernel today it'd probably take at least
> > 5 years for hardware vendors to create a system that uses it, then another 5 years to get an accurate
> > prediction of how much of what kind of software uses it; and it'd be 15 or more years until you
> > approach "max. adoption". Any statistics you find today are completely irrelevant.
> >
> > c) "different ISA (e.g. seamless support for a mixture of 80x86 and ARM cores in the same system)";
> > where allowing software to select code to suit the CPU type is strict requirement. If support
> > for this was added to the Linux kernel today it'd probably be the same (see note) - at least
> > 5 years for hardware vendors to create a system that uses it, and 15 or more years until you
> > approach "max. adoption". Any statistics you find today are completely irrelevant.
> >
> > If you combine both of these (relatively worthless statistics
> > for AVX-512, and completely irrelevant/non-existent
> > statistics for heterogeneous CPUs) you don't end up with anything that can be used for assessing
> > if a proposed change will/won't be useful after the adoption period.
> >
>
> Both the b and c variants already exist in the Arm world and are supported in the Linux kernel; there
> are Arm chips out there where not all cores support AArch32, but all cores support AArch64.
>
> However, the motivation for this support is to let you run old AArch32 binaries on the system, while
> all code that cares about power consumption is written for AArch64 - it's the opposite case to Alder
> Lake, since the intent is that you stop using the bits of the ISA that don't exist on the E cores
> permanently, but for a transition period it should be possible to still use the legacy ISA.
Oh, so at some point in the future we can just say "software for CPUs without AVX-512 is legacy software" to change the motivation for it, and limited support for heterogeneous CPUs (pinning older software that doesn't use AVX-512 to E cores) becomes good?
> > Essentially; when you say something like "Because in practice that heterogeneous model (that
> > isn't supported today) means that 99% of users will never (in 15+ years time, after it's made
> > its way through kernel support to compiler/tools support to normal applications and then reaches
> > "max. adoption") use that AVX512 hardware, since 99% of users (today and not in 15+ years
> > time) are all in libraries" the only thing it does is make me think you're stupid.
> >
> He's not talking about AVX512 users when he says 99% of users are in libraries - he's
> talking about AVX/AVX2 (the stuff the E core supports), and saying that for AVX512 to
> be worth bothering with, it needs to be usable in all the places AVX2 is used today.
Perhaps; but that only dilutes my objection without invalidating it. There'd have to be some programmers where "some performance improvement" (from AVX2) doesn't justify the hassle but "more performance improvement" (from AVX-512) will; and some programmers who'd benefit from things AVX2 doesn't provide (mask registers, half-precision floats).
Of course a majority of user-space is people that don't care much about performance in the first place; and some people that push SIMD code out of their program into libraries to make porting easier; which skews the stats towards the "99% in libraries".
> > > And that "99% of users wouldn't use it at all" is for a feature that already doesn't have very many users
> > > to begin with, because it's already fairly specialized. Compiler people think auto-vectorization is common
> > > and a big deal. Outside of very special cases it's neither.
> > > So a questionably useful feature thus becomes completely
> > > useless because you realistically can't use it in the one situation where it's most useful.
> > >
> > > I'd much rather have Intel give people more cache, more cores, or higher frequencies
> > > than give me a terminally broken heterogeneous AVX512 system.
> >
> > For this you won't have to worry - Intel can't do "broken heterogeneous AVX512" because Windows
> > is no better at "same ISA, different performance characteristics" (which I consider a necessary
> > first step towards "slightly different ISA") than Linux. It'll be something else (e.g. "broken
> > heterogeneous AVX-1024" or "broken heterogeneous SVE3") that you'll need to worry about.
> >
> There's a simpler reason why it makes no sense - it's not that difficult in hardware to use a narrow
> vector ALU (128 bit, say) and multiple clock cycles to do wide operations. You don't perform as well
> as if you use a wider ALU, but you can build an AVX512 execution engine that uses a 128 bit vector
> ALU (and thus inherently can't outperform AVX2), and for an E core, this is good enough.
To be honest; I think Alder Lake is mostly backwards. You'd want P cores for sequential/non-parallelizable work where you don't have much reason to care about AVX-512, and E cores with AVX-512 for doing parallelizable work efficiently (at lower clock frequency, etc).
Using a 256-bit ALU for AVX-512 on E cores doesn't help much if the entire point of having E cores is to do more in parallel per cycle at a lower frequency.
> You're far more likely to see a world where (e.g.) the E cores drop x87 (and maybe even
> MMX), than one in which there's a performance boost on the P cores if you use instructions
> not available on the E cores. And Linux already knows how to handle that.
It'd only take (e.g.) 5th generation Xbox having custom "homogenous Zen 4" chips. Once Windows supports homogenous CPUs (for XBox games) Windows supports homogenous CPUs (for everything), and everyone else (Linux) gets stuck with it whether they like it or not. It certainly wouldn't be the first time a game console did something unusual (PlayStation 3),
> Heterogeneous ISA where the E cores don't support deprecated instructions is fine - indeed, it's
> a good transition strategy. Heterogeneous cores where the new shiny that everyone is "supposed"
> to use for peak performance is not supported on the E cores is not; in the real world, even with
> statically linked binaries and an ability to recompile the world to support it, the effect of a
> heterogeneous ISA is that people don't use the bits of the ISA that aren't present on all cores.
Ern, no? In the real world, (some) high performance software developers use GPGPU despite the fact that it has the worse and most hacky "programmer/user experience" you could possibly imagine; and nobody says "Oh, let's just use the parts of the ISAs that are common between CPU and GPU" because there are no common parts.
Do you think this would suddenly change if it was (e.g.) eight 80x86 P cores with AVX-512 and sixteen 80x86 E cores with AVX-2048 (and there were common parts of the ISA)?
- Brendan
Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 24, 2022 2:14 pm wrote:
> Brendan (btrotter.delete@this.gmail.com) on May 24, 2022 1:44 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 23, 2022 4:54 pm wrote:
> > > Brendan (btrotter.delete@this.gmail.com) on May 23, 2022 1:13 pm wrote:
> > > >
> > > > Mental masturbation is things like circular logic - e.g.
> > > > "I don't want to support anything except the common
> > > > case, because the common case is useless, because I didn't want to support anything except the common case".
> > >
> > > It's not about me supporting it.
> > >
> > > The kernel side is fairly trivial. It ranges from "no changes at all" (ie users just do their own
> > > CPU affinity to deal with it) to "minimal changes" (some ELF flag to say "start with this affinity")
> > > to fairly straightforward bigger support (eg "fault-on-use and auto-affine the thread").
> > >
> > > In fact, when I first heard of Intel's heterogeneous model
> > > in Alder Lake, I was like "we can support that easily".
> > >
> > > Because on the kernel side, it really is mostly a non-issue. Any kernel use of AVX512
> > > is already very limited (I think we have a couple of optimized crypto library functions),
> > > and the kernel already obviously supports CPU affinities. It's stupid special-case
> > > code, but it's not necessarily complicated stupid special-case code.
> > >
> > > (Of course, anything to do with the x86 extended FP state is actually fairly complicated to
> > > begin with, because of how it's all oddly lumped together in "xstate" and has about a billion
> > > different variations, so adding m ore special cases to that code is never a good thing).
> > >
> > > So no. My argument is not at all "I don't want to support it",
> > > and you haven't heard that argument here in this thread.
> > >
> > > My argument is "it's stupid and doesn't work in user space, and any silicon that implements that
> > > heterogeneous model is just wasted space by hardware designers who couldn't do it right".
> > >
> > > Because in practice that heterogeneous model means that 99% of users will never use that AVX512 hardware,
> > > since 99% of users are all in libraries, and I hope I have explained why they would not use it.
> >
> > You have made it clear that supporting it in dynamically linked libraries is currently
> > more important than supporting it in programs and statically linked libraries.
> >
> > But... here's where we're having a problem:
> >
> > You have a habit of assuming that the past (before new technology or new capabilities
> > are introduced) accurately predicts the future (after new technology or new capabilities
> > are introduced, and after the inevitable adoption period).
> >
> > In 2012, you didn't say "0% of software uses SSE2, therefore
> > no software will use SSE2 in future, so supporting
> > SSE2 is silly and doesn't work" (or maybe you did but I
> > doubt it). You understood that introducing something
> > new changes the future; and for 80x86 PCs often it takes
> > about 10 years between the introduction of something
> > new (64-bit 80x86, SSE2, UEFI, Wayland, SystemD, ...) and the end of the adoption period.
> >
> > More specifically; for AVX-512 I think we agree that Intel bungled adoption badly (first making it
> > "HPC only" to ensure its failure because almost nobody has a reason to care, then splitting it into
> > far too many sub-features, then the "rushed" Alder Lake mess just when AVX-512 was starting to gain
> > adoption). Because of this it's like we're currently only 20% of the way into the adoption period,
> > and the statistics you can get today (for how much software of what type uses AVX-512) are relatively
> > worthless (a poor indicator of what "max. adoption" will look like in 10 years time).
> >
> > More specifically; for heterogeneous CPUs there are 3 cases:
> >
> > a) "same ISA, different performance characteristics"; where allowing software to select code to
> > suit the CPU type (with different optimization) is merely a small performance improvement and not
> > strictly required. If support for this was added to the Linux kernel today it'd probably take 5
> > years to get an accurate prediction of how much of what kind of software uses it, and 10 years
> > until you approach "max. adoption". Any statistics you find today are completely irrelevant.
> >
> This is something that exists today, and that the Linux kernel has decent support for
> already (and people are working on making it better). It's called big.LITTLE in the Arm
> world, and is well-supported because it exists for very good technical reasons.
"Support" exists (for power management and load balancing); but it's not like a normal user-space program can tell the kernel "prefer (but don't restrict to) a little core for this work"; and not like programmers can do something like "#pragma CPU_type_preference(little)" in a few places and let the compiler propagate the optimization hints throughout the call graph.
> > b) "slightly different ISA (e.g. with or without AVX512)"; where allowing software to select code
> > to suit the CPU type (with different optimization) is still just a performance improvement (over
> > just using the common subset) and not strictly required, but likely to be a larger performance
> > difference. If support for this was added to the Linux kernel today it'd probably take at least
> > 5 years for hardware vendors to create a system that uses it, then another 5 years to get an accurate
> > prediction of how much of what kind of software uses it; and it'd be 15 or more years until you
> > approach "max. adoption". Any statistics you find today are completely irrelevant.
> >
> > c) "different ISA (e.g. seamless support for a mixture of 80x86 and ARM cores in the same system)";
> > where allowing software to select code to suit the CPU type is strict requirement. If support
> > for this was added to the Linux kernel today it'd probably be the same (see note) - at least
> > 5 years for hardware vendors to create a system that uses it, and 15 or more years until you
> > approach "max. adoption". Any statistics you find today are completely irrelevant.
> >
> > If you combine both of these (relatively worthless statistics
> > for AVX-512, and completely irrelevant/non-existent
> > statistics for heterogeneous CPUs) you don't end up with anything that can be used for assessing
> > if a proposed change will/won't be useful after the adoption period.
> >
>
> Both the b and c variants already exist in the Arm world and are supported in the Linux kernel; there
> are Arm chips out there where not all cores support AArch32, but all cores support AArch64.
>
> However, the motivation for this support is to let you run old AArch32 binaries on the system, while
> all code that cares about power consumption is written for AArch64 - it's the opposite case to Alder
> Lake, since the intent is that you stop using the bits of the ISA that don't exist on the E cores
> permanently, but for a transition period it should be possible to still use the legacy ISA.
Oh, so at some point in the future we can just say "software for CPUs without AVX-512 is legacy software" to change the motivation for it, and limited support for heterogeneous CPUs (pinning older software that doesn't use AVX-512 to E cores) becomes good?
> > Essentially; when you say something like "Because in practice that heterogeneous model (that
> > isn't supported today) means that 99% of users will never (in 15+ years time, after it's made
> > its way through kernel support to compiler/tools support to normal applications and then reaches
> > "max. adoption") use that AVX512 hardware, since 99% of users (today and not in 15+ years
> > time) are all in libraries" the only thing it does is make me think you're stupid.
> >
> He's not talking about AVX512 users when he says 99% of users are in libraries - he's
> talking about AVX/AVX2 (the stuff the E core supports), and saying that for AVX512 to
> be worth bothering with, it needs to be usable in all the places AVX2 is used today.
Perhaps; but that only dilutes my objection without invalidating it. There'd have to be some programmers where "some performance improvement" (from AVX2) doesn't justify the hassle but "more performance improvement" (from AVX-512) will; and some programmers who'd benefit from things AVX2 doesn't provide (mask registers, half-precision floats).
Of course a majority of user-space is people that don't care much about performance in the first place; and some people that push SIMD code out of their program into libraries to make porting easier; which skews the stats towards the "99% in libraries".
> > > And that "99% of users wouldn't use it at all" is for a feature that already doesn't have very many users
> > > to begin with, because it's already fairly specialized. Compiler people think auto-vectorization is common
> > > and a big deal. Outside of very special cases it's neither.
> > > So a questionably useful feature thus becomes completely
> > > useless because you realistically can't use it in the one situation where it's most useful.
> > >
> > > I'd much rather have Intel give people more cache, more cores, or higher frequencies
> > > than give me a terminally broken heterogeneous AVX512 system.
> >
> > For this you won't have to worry - Intel can't do "broken heterogeneous AVX512" because Windows
> > is no better at "same ISA, different performance characteristics" (which I consider a necessary
> > first step towards "slightly different ISA") than Linux. It'll be something else (e.g. "broken
> > heterogeneous AVX-1024" or "broken heterogeneous SVE3") that you'll need to worry about.
> >
> There's a simpler reason why it makes no sense - it's not that difficult in hardware to use a narrow
> vector ALU (128 bit, say) and multiple clock cycles to do wide operations. You don't perform as well
> as if you use a wider ALU, but you can build an AVX512 execution engine that uses a 128 bit vector
> ALU (and thus inherently can't outperform AVX2), and for an E core, this is good enough.
To be honest; I think Alder Lake is mostly backwards. You'd want P cores for sequential/non-parallelizable work where you don't have much reason to care about AVX-512, and E cores with AVX-512 for doing parallelizable work efficiently (at lower clock frequency, etc).
Using a 256-bit ALU for AVX-512 on E cores doesn't help much if the entire point of having E cores is to do more in parallel per cycle at a lower frequency.
> You're far more likely to see a world where (e.g.) the E cores drop x87 (and maybe even
> MMX), than one in which there's a performance boost on the P cores if you use instructions
> not available on the E cores. And Linux already knows how to handle that.
It'd only take (e.g.) 5th generation Xbox having custom "homogenous Zen 4" chips. Once Windows supports homogenous CPUs (for XBox games) Windows supports homogenous CPUs (for everything), and everyone else (Linux) gets stuck with it whether they like it or not. It certainly wouldn't be the first time a game console did something unusual (PlayStation 3),
> Heterogeneous ISA where the E cores don't support deprecated instructions is fine - indeed, it's
> a good transition strategy. Heterogeneous cores where the new shiny that everyone is "supposed"
> to use for peak performance is not supported on the E cores is not; in the real world, even with
> statically linked binaries and an ability to recompile the world to support it, the effect of a
> heterogeneous ISA is that people don't use the bits of the ISA that aren't present on all cores.
Ern, no? In the real world, (some) high performance software developers use GPGPU despite the fact that it has the worse and most hacky "programmer/user experience" you could possibly imagine; and nobody says "Oh, let's just use the parts of the ISAs that are common between CPU and GPU" because there are no common parts.
Do you think this would suddenly change if it was (e.g.) eight 80x86 P cores with AVX-512 and sixteen 80x86 E cores with AVX-2048 (and there were common parts of the ISA)?
- Brendan