By: Simon Farnsworth (simon.delete@this.farnz.org.uk), May 23, 2022 4:16 am
Room: Moderated Discussions
⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on May 22, 2022 11:51 am wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 21, 2022 4:58 pm wrote:
> > That second case requires a working and reliable CPUID bit that doesn't cause the
> > code to either go ridiculously slowly (emulation) or get relegated to just a subset
> > of the cores in the system (trap-and-migrate or explicit affinities).
>
> I don't want to write a reaction to the whole heterogenous-x86-cores discussion, because it is obvious that
> you and I are deeply in disagreement. Instead, I would like to briefly mention the following argument:
>
> Binary translation isn't "ridiculously slow". People who claim that emulation is slow are most
> likely thinking about a basic emulation algorithm/method without any code translation caches.
>
> Considering the fact that you worked for Transmeta, I fail to understand why
> you claim that emulation is "ridiculously slow". It isn't. (I presume that "ridiculously
> slow" means "2 times slower or worse" or something like that.)
>
> The Linux kernel is what it is: there is no "advanced native" support for binary translation in the Linux
> kernel. If the kernel already supported it then it would be easier to run heterogeneous x86 apps in Linux
> because an AVX-512 app would be able to run on Alder Lake E-cores (with a reasonable performance penalty,
> and in case the performance penalty was measured - by the kernel - to be unreasonable then the kernel
> would try to pin the app to Alder Lake's P-cores). If there are idle P-cores available and the CPU performance
> governor isn't set to powersave, there is little need to run a process on an E-core.
>
> You are overly protective of what the Linux kernel currently is. There is no vision of a future
> of heterogeneous CPUs in your posts .... if heterogeneous desktop/notebook CPUs are inevitable
> then you should have a plan for it or make a plan for it. (An example reason why heterogeneous
> CPUs are inevitable in those markets is that endowing _all_ cores in a future desktop machine
> with the ability to predict 4 branches per cycle would be problematic.)
>
> -atom
Heterogeneous cores with the same ISA is something that the kernel works with today, but that's still a study area - how do you make the kernel make the right decisions about which cores to use and when given energy efficiency and performance details per core, and given that the performance details of a core depend not just on the core itself, but also on the decisions you make for other cores?
It is, however, good enough to let you mix tiny cores that max out at 1 IPC given optimal code and massive OoOE cores capable of sustaining 8 IPC even in the face of branches, assuming that's sensible in the hardware; where a process does heavy crunching, it'll end up on the massive OoOE core, where it wakes up, runs a few thousand instructions, then blocks in the kernel again, it'll run on the tiny core.
Adding heterogeneous ISA to the mix adds another layer of complexity to the already unsolved problem - and it's not clear that this is the direction hardware will take. For example, it's not clear that in SVE2 the logic to handle a change in vector length on context switch would be cheaper than having the hardware expose an "efficient" and a "fast" vector length, and then handling long vectors on the smaller hardware by using more clock cycles - e.g. handling a 2048 bit vector by 16 passes through a 128 bit vector ALU.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 21, 2022 4:58 pm wrote:
> > That second case requires a working and reliable CPUID bit that doesn't cause the
> > code to either go ridiculously slowly (emulation) or get relegated to just a subset
> > of the cores in the system (trap-and-migrate or explicit affinities).
>
> I don't want to write a reaction to the whole heterogenous-x86-cores discussion, because it is obvious that
> you and I are deeply in disagreement. Instead, I would like to briefly mention the following argument:
>
> Binary translation isn't "ridiculously slow". People who claim that emulation is slow are most
> likely thinking about a basic emulation algorithm/method without any code translation caches.
>
> Considering the fact that you worked for Transmeta, I fail to understand why
> you claim that emulation is "ridiculously slow". It isn't. (I presume that "ridiculously
> slow" means "2 times slower or worse" or something like that.)
>
> The Linux kernel is what it is: there is no "advanced native" support for binary translation in the Linux
> kernel. If the kernel already supported it then it would be easier to run heterogeneous x86 apps in Linux
> because an AVX-512 app would be able to run on Alder Lake E-cores (with a reasonable performance penalty,
> and in case the performance penalty was measured - by the kernel - to be unreasonable then the kernel
> would try to pin the app to Alder Lake's P-cores). If there are idle P-cores available and the CPU performance
> governor isn't set to powersave, there is little need to run a process on an E-core.
>
> You are overly protective of what the Linux kernel currently is. There is no vision of a future
> of heterogeneous CPUs in your posts .... if heterogeneous desktop/notebook CPUs are inevitable
> then you should have a plan for it or make a plan for it. (An example reason why heterogeneous
> CPUs are inevitable in those markets is that endowing _all_ cores in a future desktop machine
> with the ability to predict 4 branches per cycle would be problematic.)
>
> -atom
Heterogeneous cores with the same ISA is something that the kernel works with today, but that's still a study area - how do you make the kernel make the right decisions about which cores to use and when given energy efficiency and performance details per core, and given that the performance details of a core depend not just on the core itself, but also on the decisions you make for other cores?
It is, however, good enough to let you mix tiny cores that max out at 1 IPC given optimal code and massive OoOE cores capable of sustaining 8 IPC even in the face of branches, assuming that's sensible in the hardware; where a process does heavy crunching, it'll end up on the massive OoOE core, where it wakes up, runs a few thousand instructions, then blocks in the kernel again, it'll run on the tiny core.
Adding heterogeneous ISA to the mix adds another layer of complexity to the already unsolved problem - and it's not clear that this is the direction hardware will take. For example, it's not clear that in SVE2 the logic to handle a change in vector length on context switch would be cheaper than having the hardware expose an "efficient" and a "fast" vector length, and then handling long vectors on the smaller hardware by using more clock cycles - e.g. handling a 2048 bit vector by 16 passes through a 128 bit vector ALU.