By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), May 20, 2022 11:06 pm
Room: Moderated Discussions
Brendan (btrotter.delete@this.gmail.com) on May 20, 2022 8:42 pm wrote:
>
> Just because a thread got migrated to a P core doesn't mean it has to stay there - you could migrate
> a thread back to the E core for a while (until it uses the library again) if you want.
That's the "hey, this can be fixed" thing.
But the unfixable thing is much more fundamental: 'cpuid' is suddenly not reliable or meaningful.
Basically, what does 'cpuid' mean?
Does it mean "what are the capabilities of the CPU I happen to be running on right now"? But then it's useless in any system where the load can be migrated to another CPU at any time.
Or does 'cpuid' mean "Ok, I now give you a set of capabilities that I guarantee"? So now any process that has ever run 'cpuid' will be tied to a code that matches what it was told?
In other words, you are looking for an engineering solution to "oh, this core doesn't do instruction XYZ", but you are missing the much more fundamental issue. Intel by design has very much exposed that whole "query what the CPU supports" thing as a native instruction, and you cannot make that instruction work with sane semantics in a heterogeneous system.
And 'cpuid' is not some small implementation detail. It's literally what any core system library would use to decide "How do I choose implement functionality XYZ?". So 'cpuid' has to work, and it has or be reliable and meaningful, because it's literally how people will make the decision on whether to use the AVX512 version of the library or not.
If you claim you do AVX512, then all processes end up getting pinned to big cores just because some random library goes "oh, then I'll use it".
And if you claim not to do AVX512, then people won't be using it at all, and you would be better off not having it.,
And if you randomly return a value based on "right now you happen to be running on CPU X, so you do or do not have AVX512 based on that", you end up with random performance and the worst of both of the above worlds.
And that is all assuming that the system software bent over backwards to make the whole thing work with auto-migration in the first place, so all these bad outcomes actually require a fair amount of engineering to even work at all (ok, except for the "never report AVX512" case, of course).
End result: you can't win.
So you're answering the wrong question entirely. The question was never "Can I auto-migrate a process that uses AVX512 to a big core that supports it, and maybe auto-demote it later?"
No. The question was much more fundamental: 'what does cpuid report?'.
And that question simply has no valid useful answer in the heterogeneous system.
Ergo: the heterogeneous model is broken. Fundamentally and unfixably so.
(And as always: in embedded systems you can do anything you want, since you control the horizontal and you control the vertical. And so you can just keep big and small cores separate and never migrate things at all, or only do it for loads that explicitly have asked for it)
Linus
>
> Just because a thread got migrated to a P core doesn't mean it has to stay there - you could migrate
> a thread back to the E core for a while (until it uses the library again) if you want.
That's the "hey, this can be fixed" thing.
But the unfixable thing is much more fundamental: 'cpuid' is suddenly not reliable or meaningful.
Basically, what does 'cpuid' mean?
Does it mean "what are the capabilities of the CPU I happen to be running on right now"? But then it's useless in any system where the load can be migrated to another CPU at any time.
Or does 'cpuid' mean "Ok, I now give you a set of capabilities that I guarantee"? So now any process that has ever run 'cpuid' will be tied to a code that matches what it was told?
In other words, you are looking for an engineering solution to "oh, this core doesn't do instruction XYZ", but you are missing the much more fundamental issue. Intel by design has very much exposed that whole "query what the CPU supports" thing as a native instruction, and you cannot make that instruction work with sane semantics in a heterogeneous system.
And 'cpuid' is not some small implementation detail. It's literally what any core system library would use to decide "How do I choose implement functionality XYZ?". So 'cpuid' has to work, and it has or be reliable and meaningful, because it's literally how people will make the decision on whether to use the AVX512 version of the library or not.
If you claim you do AVX512, then all processes end up getting pinned to big cores just because some random library goes "oh, then I'll use it".
And if you claim not to do AVX512, then people won't be using it at all, and you would be better off not having it.,
And if you randomly return a value based on "right now you happen to be running on CPU X, so you do or do not have AVX512 based on that", you end up with random performance and the worst of both of the above worlds.
And that is all assuming that the system software bent over backwards to make the whole thing work with auto-migration in the first place, so all these bad outcomes actually require a fair amount of engineering to even work at all (ok, except for the "never report AVX512" case, of course).
End result: you can't win.
So you're answering the wrong question entirely. The question was never "Can I auto-migrate a process that uses AVX512 to a big core that supports it, and maybe auto-demote it later?"
No. The question was much more fundamental: 'what does cpuid report?'.
And that question simply has no valid useful answer in the heterogeneous system.
Ergo: the heterogeneous model is broken. Fundamentally and unfixably so.
(And as always: in embedded systems you can do anything you want, since you control the horizontal and you control the vertical. And so you can just keep big and small cores separate and never migrate things at all, or only do it for loads that explicitly have asked for it)
Linus