By: Brendan (btrotter.delete@this.gmail.com), May 23, 2022 1:51 am
Room: Moderated Discussions
Hi,
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 22, 2022 11:11 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 22, 2022 9:49 pm wrote:
> > And exactly like Andrey tried to explain to you, the actual library function that gets used tends
> > to be picked either at library load time or at first use, and then it is fixed for the lifetime of
> > the whole process (and fixed across threads). That isn't the only way to do it, no, but it's by far
> > the common one, and it's one of the major uses of the cpuid instruction in modern programming.
> I agree this is how it's currently done (including by us). But given that we are moving into a world with more
> heterogeneity (whether we like it or not), isn't that an occasion to perhaps adapt what we've been doing?
>
> Here is a proposal for lightweight but adaptable dispatch that seems reasonable, do you see any major flaws?
> 1) At startup the library runs existing (expensive) CPUID-based checks once per (enabled)
> logical processor, and stores an array of "which features can I use" bitfields.
> 2) The OS provides a lightweight "don't migrate me" flag, perhaps
> even mapped into user space to avoid kernel entry.
> 3) On entering the library, set the no_migrate flag, get the current logical processor index,
> use it to look up our bitfield, ctz() on that to get the index of the function pointer to
> call from our pre-baked table. After we're done with SIMD, reset the no_migrate.
Minor alterations:
a) I'd still prefer a "migration disables counter" for nesting (increment when entering library, decrement when leaving library; in case the caller already disabled migration before calling the library). This could be in thread local storage.
b) Kernel can maintain a "current CPU type" (set when thread created, updated when thread migrated to a different CPU type) in a thread's thread local storage (in same cache line as a thread's "migration disables counter") so that you don't need "once per (enabled) logical processor" ("once per CPU type" is enough).
For a more invasive approach; OS could "pre-initialize" libraries (when they're updated or CPUs are replaced) and store the result (as "cached pre-initialized library files"); and give each process 2 virtual address spaces that are almost identical except for which version of the pre-initialized library is used; where scheduler changes a thread's virtual address space whenever it migrates the thread to a different CPU type. This would avoid all of the startup overhead and most of the dispatch overhead (even for the "homogenous CPUs" case where you'd only have 1 virtual address space per process).
- Brendan
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 22, 2022 11:11 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on May 22, 2022 9:49 pm wrote:
> > And exactly like Andrey tried to explain to you, the actual library function that gets used tends
> > to be picked either at library load time or at first use, and then it is fixed for the lifetime of
> > the whole process (and fixed across threads). That isn't the only way to do it, no, but it's by far
> > the common one, and it's one of the major uses of the cpuid instruction in modern programming.
> I agree this is how it's currently done (including by us). But given that we are moving into a world with more
> heterogeneity (whether we like it or not), isn't that an occasion to perhaps adapt what we've been doing?
>
> Here is a proposal for lightweight but adaptable dispatch that seems reasonable, do you see any major flaws?
> 1) At startup the library runs existing (expensive) CPUID-based checks once per (enabled)
> logical processor, and stores an array of "which features can I use" bitfields.
> 2) The OS provides a lightweight "don't migrate me" flag, perhaps
> even mapped into user space to avoid kernel entry.
> 3) On entering the library, set the no_migrate flag, get the current logical processor index,
> use it to look up our bitfield, ctz() on that to get the index of the function pointer to
> call from our pre-baked table. After we're done with SIMD, reset the no_migrate.
Minor alterations:
a) I'd still prefer a "migration disables counter" for nesting (increment when entering library, decrement when leaving library; in case the caller already disabled migration before calling the library). This could be in thread local storage.
b) Kernel can maintain a "current CPU type" (set when thread created, updated when thread migrated to a different CPU type) in a thread's thread local storage (in same cache line as a thread's "migration disables counter") so that you don't need "once per (enabled) logical processor" ("once per CPU type" is enough).
For a more invasive approach; OS could "pre-initialize" libraries (when they're updated or CPUs are replaced) and store the result (as "cached pre-initialized library files"); and give each process 2 virtual address spaces that are almost identical except for which version of the pre-initialized library is used; where scheduler changes a thread's virtual address space whenever it migrates the thread to a different CPU type. This would avoid all of the startup overhead and most of the dispatch overhead (even for the "homogenous CPUs" case where you'd only have 1 virtual address space per process).
- Brendan