By: anon2 (anon.delete@this.anon.com), May 21, 2022 1:45 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 21, 2022 12:06 am wrote:
> Brendan (btrotter.delete@this.gmail.com) on May 20, 2022 8:42 pm wrote:
> >
> > Just because a thread got migrated to a P core doesn't mean it has to stay there - you could migrate
> > a thread back to the E core for a while (until it uses the library again) if you want.
>
> That's the "hey, this can be fixed" thing.
>
> But the unfixable thing is much more fundamental: 'cpuid' is suddenly not reliable or meaningful.
>
> Basically, what does 'cpuid' mean?
>
> Does it mean "what are the capabilities of the CPU I happen to be running on right now"? But
> then it's useless in any system where the load can be migrated to another CPU at any time.
>
> Or does 'cpuid' mean "Ok, I now give you a set of capabilities that I guarantee"? So now any
> process that has ever run 'cpuid' will be tied to a code that matches what it was told?
Have all cores report AVX-512 is available even the small cores. When a small core tries to execute an AVX-512 instruction it will trap and the OS can move it to a big core then resume it. They don't have to be stuck there, you can have heuristics (time interval, or look at AVX-512 activity) to allow it to move back to a small core.
> In other words, you are looking for an engineering solution to "oh, this core doesn't do instruction
> XYZ", but you are missing the much more fundamental issue. Intel by design has very much exposed
> that whole "query what the CPU supports" thing as a native instruction, and you cannot make
> that instruction work with sane semantics in a heterogeneous system.
>
> And 'cpuid' is not some small implementation detail. It's literally what any core system
> library would use to decide "How do I choose implement functionality XYZ?". So 'cpuid' has
> to work, and it has or be reliable and meaningful, because it's literally how people will
> make the decision on whether to use the AVX512 version of the library or not.
>
> If you claim you do AVX512, then all processes end up getting pinned to
> big cores just because some random library goes "oh, then I'll use it".
>
> And if you claim not to do AVX512, then people won't be using
> it at all, and you would be better off not having it.,
>
> And if you randomly return a value based on "right now you happen to be running on CPU X, so you do or do not
> have AVX512 based on that", you end up with random performance and the worst of both of the above worlds.
>
> And that is all assuming that the system software bent over backwards to make the whole thing work
> with auto-migration in the first place, so all these bad outcomes actually require a fair amount of
> engineering to even work at all (ok, except for the "never report AVX512" case, of course).
>
> End result: you can't win.
>
> So you're answering the wrong question entirely. The question was never "Can I auto-migrate
> a process that uses AVX512 to a big core that supports it, and maybe auto-demote it later?"
>
> No. The question was much more fundamental: 'what does cpuid report?'.
>
> And that question simply has no valid useful answer in the heterogeneous system.
>
> Ergo: the heterogeneous model is broken. Fundamentally and unfixably so.
>
> (And as always: in embedded systems you can do anything you want, since you control the horizontal
> and you control the vertical. And so you can just keep big and small cores separate and never
> migrate things at all, or only do it for loads that explicitly have asked for it)
>
> Linus
> Brendan (btrotter.delete@this.gmail.com) on May 20, 2022 8:42 pm wrote:
> >
> > Just because a thread got migrated to a P core doesn't mean it has to stay there - you could migrate
> > a thread back to the E core for a while (until it uses the library again) if you want.
>
> That's the "hey, this can be fixed" thing.
>
> But the unfixable thing is much more fundamental: 'cpuid' is suddenly not reliable or meaningful.
>
> Basically, what does 'cpuid' mean?
>
> Does it mean "what are the capabilities of the CPU I happen to be running on right now"? But
> then it's useless in any system where the load can be migrated to another CPU at any time.
>
> Or does 'cpuid' mean "Ok, I now give you a set of capabilities that I guarantee"? So now any
> process that has ever run 'cpuid' will be tied to a code that matches what it was told?
Have all cores report AVX-512 is available even the small cores. When a small core tries to execute an AVX-512 instruction it will trap and the OS can move it to a big core then resume it. They don't have to be stuck there, you can have heuristics (time interval, or look at AVX-512 activity) to allow it to move back to a small core.
> In other words, you are looking for an engineering solution to "oh, this core doesn't do instruction
> XYZ", but you are missing the much more fundamental issue. Intel by design has very much exposed
> that whole "query what the CPU supports" thing as a native instruction, and you cannot make
> that instruction work with sane semantics in a heterogeneous system.
>
> And 'cpuid' is not some small implementation detail. It's literally what any core system
> library would use to decide "How do I choose implement functionality XYZ?". So 'cpuid' has
> to work, and it has or be reliable and meaningful, because it's literally how people will
> make the decision on whether to use the AVX512 version of the library or not.
>
> If you claim you do AVX512, then all processes end up getting pinned to
> big cores just because some random library goes "oh, then I'll use it".
>
> And if you claim not to do AVX512, then people won't be using
> it at all, and you would be better off not having it.,
>
> And if you randomly return a value based on "right now you happen to be running on CPU X, so you do or do not
> have AVX512 based on that", you end up with random performance and the worst of both of the above worlds.
>
> And that is all assuming that the system software bent over backwards to make the whole thing work
> with auto-migration in the first place, so all these bad outcomes actually require a fair amount of
> engineering to even work at all (ok, except for the "never report AVX512" case, of course).
>
> End result: you can't win.
>
> So you're answering the wrong question entirely. The question was never "Can I auto-migrate
> a process that uses AVX512 to a big core that supports it, and maybe auto-demote it later?"
>
> No. The question was much more fundamental: 'what does cpuid report?'.
>
> And that question simply has no valid useful answer in the heterogeneous system.
>
> Ergo: the heterogeneous model is broken. Fundamentally and unfixably so.
>
> (And as always: in embedded systems you can do anything you want, since you control the horizontal
> and you control the vertical. And so you can just keep big and small cores separate and never
> migrate things at all, or only do it for loads that explicitly have asked for it)
>
> Linus