By: Brendan (btrotter.delete@this.gmail.com), May 21, 2022 12:58 pm
Room: Moderated Discussions
Hi,
Andrey (andrey.semashev.delete@this.gmail.com) on May 21, 2022 11:30 am wrote:
> Brendan (btrotter.delete@this.gmail.com) on May 21, 2022 10:36 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 21, 2022 12:06 am wrote:
> >
> > How about we create an OS where unsupported instructions are emulated; so that you can run a new executable
> > that uses AVX-512 on a crusty old Prescott Pentium 4 from
> > 2005 without any problem (other than performance);
> > and where a "some cores don't support AVX-512" problem becomes a performance issue (a minor addition to
> > the "P cores are faster than E cores" performance issue you already have to deal with)?
>
> No need to write an OS. There is already SDE that you can test with. Yes, it works in principle,
> but performance is a real issue. No, not a minor issue, but one that can easily make the emulated
> software unusable in the practical sense. Or put it another way, you would have been better
> off not having AVX-512 in the first place instead of trying to emulate it.
SDE is a heavy weight tool designed for things like debugging and instrumentation (not performance); that (if I understand it properly) injects code into the original program rather than relying on a kernel's invalid opcode exception handler. This makes it unrepresentative of the performance you'd expect, and also makes it unable to work properly in some cases (threads in the same process using different CPUs where AVX-512 emulation would be injected into threads running on CPUs that support AVX-512 natively).
For a crude guess; I'd expect the invalid opcode exception handler approach would be around 30 times slower for "very AVX-512 heavy" code and maybe negligibly slower for "very infrequent AVX-512 use" (compared to running on a CPU that supports it); and "infinitely faster" (compared to not executing any code because P cores are busy while E cores are idle).
> > > No. The question was much more fundamental: 'what does cpuid report?'.
> > >
> > > And that question simply has no valid useful answer in the heterogeneous system.
> > >
> > > Ergo: the heterogeneous model is broken. Fundamentally and unfixably so.
> >
> > Nonsense. E.g. Intel could easily say "old CPUID leaves (with bit 15 of EAX clear) report information
> > that's compatible with all cores; but (for CPUs that have mixed cores) here's a whole new set of CPUID
> > leaves (with bit 15 of EAX set) that report information that only applies to this specific CPU" so that
> > old software that checks CPUID works and new software that supports different CPUs can work better.
>
> You're missing the point. Information that is about the current core is stale (i.e. useless)
> because your thread might have been moved to another core right after cpuid completed.
No. New software (designed to use the new CPUID leaves) would be aware of that problem and would avoid it - e.g. maybe using something like "sched_setaffinity()" to lock the thread to a specific CPU type before using CPUID (and maybe using "sched_setaffinity()" again later to restore the original CPU affinity and allow migration again).
- Brendan
Andrey (andrey.semashev.delete@this.gmail.com) on May 21, 2022 11:30 am wrote:
> Brendan (btrotter.delete@this.gmail.com) on May 21, 2022 10:36 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 21, 2022 12:06 am wrote:
> >
> > How about we create an OS where unsupported instructions are emulated; so that you can run a new executable
> > that uses AVX-512 on a crusty old Prescott Pentium 4 from
> > 2005 without any problem (other than performance);
> > and where a "some cores don't support AVX-512" problem becomes a performance issue (a minor addition to
> > the "P cores are faster than E cores" performance issue you already have to deal with)?
>
> No need to write an OS. There is already SDE that you can test with. Yes, it works in principle,
> but performance is a real issue. No, not a minor issue, but one that can easily make the emulated
> software unusable in the practical sense. Or put it another way, you would have been better
> off not having AVX-512 in the first place instead of trying to emulate it.
SDE is a heavy weight tool designed for things like debugging and instrumentation (not performance); that (if I understand it properly) injects code into the original program rather than relying on a kernel's invalid opcode exception handler. This makes it unrepresentative of the performance you'd expect, and also makes it unable to work properly in some cases (threads in the same process using different CPUs where AVX-512 emulation would be injected into threads running on CPUs that support AVX-512 natively).
For a crude guess; I'd expect the invalid opcode exception handler approach would be around 30 times slower for "very AVX-512 heavy" code and maybe negligibly slower for "very infrequent AVX-512 use" (compared to running on a CPU that supports it); and "infinitely faster" (compared to not executing any code because P cores are busy while E cores are idle).
> > > No. The question was much more fundamental: 'what does cpuid report?'.
> > >
> > > And that question simply has no valid useful answer in the heterogeneous system.
> > >
> > > Ergo: the heterogeneous model is broken. Fundamentally and unfixably so.
> >
> > Nonsense. E.g. Intel could easily say "old CPUID leaves (with bit 15 of EAX clear) report information
> > that's compatible with all cores; but (for CPUs that have mixed cores) here's a whole new set of CPUID
> > leaves (with bit 15 of EAX set) that report information that only applies to this specific CPU" so that
> > old software that checks CPUID works and new software that supports different CPUs can work better.
>
> You're missing the point. Information that is about the current core is stale (i.e. useless)
> because your thread might have been moved to another core right after cpuid completed.
No. New software (designed to use the new CPUID leaves) would be aware of that problem and would avoid it - e.g. maybe using something like "sched_setaffinity()" to lock the thread to a specific CPU type before using CPUID (and maybe using "sched_setaffinity()" again later to restore the original CPU affinity and allow migration again).
- Brendan