By: Björn Ragnar Björnsson (bjorn.ragnar.delete@this.gmail.com), May 23, 2022 5:41 pm
Room: Moderated Discussions
Brendan (btrotter.delete@this.gmail.com) on May 23, 2022 1:51 am wrote:
> Hi,
>
> Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 22, 2022 11:11 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 22, 2022 9:49 pm wrote:
> > > And exactly like Andrey tried to explain to you, the actual library function that gets used tends
> > > to be picked either at library load time or at first use, and then it is fixed for the lifetime of
> > > the whole process (and fixed across threads). That isn't the only way to do it, no, but it's by far
> > > the common one, and it's one of the major uses of the cpuid instruction in modern programming.
> > I agree this is how it's currently done (including by us).
> > But given that we are moving into a world with more
> > heterogeneity (whether we like it or not), isn't that an occasion to perhaps adapt what we've been doing?
> >
> > Here is a proposal for lightweight but adaptable dispatch that seems reasonable, do you see any major flaws?
> > 1) At startup the library runs existing (expensive) CPUID-based checks once per (enabled)
> > logical processor, and stores an array of "which features can I use" bitfields.
> > 2) The OS provides a lightweight "don't migrate me" flag, perhaps
> > even mapped into user space to avoid kernel entry.
> > 3) On entering the library, set the no_migrate flag, get the current logical processor index,
> > use it to look up our bitfield, ctz() on that to get the index of the function pointer to
> > call from our pre-baked table. After we're done with SIMD, reset the no_migrate.
>
> Minor alterations:
>
> a) I'd still prefer a "migration disables counter" for nesting (increment when entering
> library, decrement when leaving library; in case the caller already disabled migration
> before calling the library). This could be in thread local storage.
>
> b) Kernel can maintain a "current CPU type" (set when thread created, updated when thread migrated to a different
> CPU type) in a thread's thread local storage (in same cache line as a thread's "migration disables counter")
> so that you don't need "once per (enabled) logical processor" ("once per CPU type" is enough).
>
> For a more invasive approach; OS could "pre-initialize" libraries (when they're updated or CPUs are replaced)
> and store the result (as "cached pre-initialized library files"); and give each process 2 virtual address
> spaces that are almost identical except for which version of the pre-initialized library is used; where
> scheduler changes a thread's virtual address space whenever it migrates the thread to a different CPU
> type. This would avoid all of the startup overhead and most of the dispatch overhead (even for the "homogenous
> CPUs" case where you'd only have 1 virtual address space per process).
>
> - Brendan
>
All of these "Proposal for heterogeneous runtime dispatch" proposals sound to me like layer upon layer of lipstick applied to a pig with far reaching consequences on OSes and working software for decades to come if approved. As if the entire world should spin around trying to allow processes to migrate with minimum cost between dissimilar cores in an Alder Lake het. core CPU. None of the proposed solutions is cheap. Their impact on efficiency in the real world is almost certainly negative (in the "entire world" context). Let's not forget that only a miniscule fraction of the CPUs ever produced or are likely to be produced in the medium term could benefit from such shenanigans. The cost in software support and hardware cycles in all x86-64 systems (including Alder Lake CPUs) is in addition to the crazy waste in software development/support effort.
So, give it a rest, please. Let's agree to let Intel forget and let us forget about this, egrerious, mistake without undue penalty. Hopefully they've learned their lesson, to wit they disabled AVX-512 in BIOS updates, didn't they?
> Hi,
>
> Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 22, 2022 11:11 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 22, 2022 9:49 pm wrote:
> > > And exactly like Andrey tried to explain to you, the actual library function that gets used tends
> > > to be picked either at library load time or at first use, and then it is fixed for the lifetime of
> > > the whole process (and fixed across threads). That isn't the only way to do it, no, but it's by far
> > > the common one, and it's one of the major uses of the cpuid instruction in modern programming.
> > I agree this is how it's currently done (including by us).
> > But given that we are moving into a world with more
> > heterogeneity (whether we like it or not), isn't that an occasion to perhaps adapt what we've been doing?
> >
> > Here is a proposal for lightweight but adaptable dispatch that seems reasonable, do you see any major flaws?
> > 1) At startup the library runs existing (expensive) CPUID-based checks once per (enabled)
> > logical processor, and stores an array of "which features can I use" bitfields.
> > 2) The OS provides a lightweight "don't migrate me" flag, perhaps
> > even mapped into user space to avoid kernel entry.
> > 3) On entering the library, set the no_migrate flag, get the current logical processor index,
> > use it to look up our bitfield, ctz() on that to get the index of the function pointer to
> > call from our pre-baked table. After we're done with SIMD, reset the no_migrate.
>
> Minor alterations:
>
> a) I'd still prefer a "migration disables counter" for nesting (increment when entering
> library, decrement when leaving library; in case the caller already disabled migration
> before calling the library). This could be in thread local storage.
>
> b) Kernel can maintain a "current CPU type" (set when thread created, updated when thread migrated to a different
> CPU type) in a thread's thread local storage (in same cache line as a thread's "migration disables counter")
> so that you don't need "once per (enabled) logical processor" ("once per CPU type" is enough).
>
> For a more invasive approach; OS could "pre-initialize" libraries (when they're updated or CPUs are replaced)
> and store the result (as "cached pre-initialized library files"); and give each process 2 virtual address
> spaces that are almost identical except for which version of the pre-initialized library is used; where
> scheduler changes a thread's virtual address space whenever it migrates the thread to a different CPU
> type. This would avoid all of the startup overhead and most of the dispatch overhead (even for the "homogenous
> CPUs" case where you'd only have 1 virtual address space per process).
>
> - Brendan
>
All of these "Proposal for heterogeneous runtime dispatch" proposals sound to me like layer upon layer of lipstick applied to a pig with far reaching consequences on OSes and working software for decades to come if approved. As if the entire world should spin around trying to allow processes to migrate with minimum cost between dissimilar cores in an Alder Lake het. core CPU. None of the proposed solutions is cheap. Their impact on efficiency in the real world is almost certainly negative (in the "entire world" context). Let's not forget that only a miniscule fraction of the CPUs ever produced or are likely to be produced in the medium term could benefit from such shenanigans. The cost in software support and hardware cycles in all x86-64 systems (including Alder Lake CPUs) is in addition to the crazy waste in software development/support effort.
So, give it a rest, please. Let's agree to let Intel forget and let us forget about this, egrerious, mistake without undue penalty. Hopefully they've learned their lesson, to wit they disabled AVX-512 in BIOS updates, didn't they?