By: Doug S (foo.delete@this.bar.bar), May 18, 2021 8:45 am
Room: Moderated Discussions
Foo_ (foo.delete@this.nomail.com) on May 18, 2021 1:58 am wrote:
> Little Horn (sink.delete@this.example.net) on May 17, 2021 5:03 pm wrote:
> > Thoughts?
>
> It looks like they are hand-waving a lot of the difficulties.
>
> - storing lots of thread state in registers in the CPU is supposed to be cheap (they
> also limit their size estimate to SSE3, conveniently ignoring AVX and AVX512)
If you had some hardware thread types that didn't support SIMD or FP instructions at all you could minimize this difficulty for stuff like hardware interrupt threads. Does doing I/O to a USB controller or network controller to set up a DMA transfer need those? I shouldn't think so.
> - software scheduling still needs to happen in the not unlikely event that the total number of threads
> managed by software is larger than the number of HW theads (not unlikely, because they are advocating
> for the use of the blocking IO model, which requires one thread per on-going request)
Yes, 100%! You add all that extra hardware complexity but you STILL have to support context switching unless you want the hardware threads to limit the number of software threads. Maybe that's feasible in a smartphone or "average case" PC, but for power users and definitely servers there's no way you could include a sufficient number of hardware threads to the point where you can have a hard limit on the number of software threads possible.
> - cache and TLB flushes will still happen... but, according to the authors, it's not a problem because it's
> not worse. Yet avoiding the *cost* of context switches was the entire motivation for their proposal
This is the biggest problem. Cache and TLB flushes are a lot of the cost of context switches, it isn't all saving registers. Basically it sounds like they are saying "we'll avoid the cost of saving registers and other thread state up to a certain number of threads, but everything else is the same and oh yeah the amount of silicon used for registers will increase by a couple orders of magnitude".
It is better, but is that really the best use of all those extra transistors? Especially since these days scaling for cache (I assume registers would scale more like cache than logic) is really poor - according to TSMC for N3 vs N5 logic scales by 70% but cache scales by only 20%. It would appear, at least on the leading edge, that a solution that requires more logic is superior to one that requires more cache/registers.
As above with my suggestion of having hardware thread types that don't support SIMD or FP, I suppose if you had a control register that let you disable those instruction types for certain software threads you could avoid expensive saves for those registers on some context switches. I haven't really thought this through much so it may have some obvious flaws, but it would be a much easier way to reduce the burden of saving thread state than what the authors of this paper are recommending.
> Little Horn (sink.delete@this.example.net) on May 17, 2021 5:03 pm wrote:
> > Thoughts?
>
> It looks like they are hand-waving a lot of the difficulties.
>
> - storing lots of thread state in registers in the CPU is supposed to be cheap (they
> also limit their size estimate to SSE3, conveniently ignoring AVX and AVX512)
If you had some hardware thread types that didn't support SIMD or FP instructions at all you could minimize this difficulty for stuff like hardware interrupt threads. Does doing I/O to a USB controller or network controller to set up a DMA transfer need those? I shouldn't think so.
> - software scheduling still needs to happen in the not unlikely event that the total number of threads
> managed by software is larger than the number of HW theads (not unlikely, because they are advocating
> for the use of the blocking IO model, which requires one thread per on-going request)
Yes, 100%! You add all that extra hardware complexity but you STILL have to support context switching unless you want the hardware threads to limit the number of software threads. Maybe that's feasible in a smartphone or "average case" PC, but for power users and definitely servers there's no way you could include a sufficient number of hardware threads to the point where you can have a hard limit on the number of software threads possible.
> - cache and TLB flushes will still happen... but, according to the authors, it's not a problem because it's
> not worse. Yet avoiding the *cost* of context switches was the entire motivation for their proposal
This is the biggest problem. Cache and TLB flushes are a lot of the cost of context switches, it isn't all saving registers. Basically it sounds like they are saying "we'll avoid the cost of saving registers and other thread state up to a certain number of threads, but everything else is the same and oh yeah the amount of silicon used for registers will increase by a couple orders of magnitude".
It is better, but is that really the best use of all those extra transistors? Especially since these days scaling for cache (I assume registers would scale more like cache than logic) is really poor - according to TSMC for N3 vs N5 logic scales by 70% but cache scales by only 20%. It would appear, at least on the leading edge, that a solution that requires more logic is superior to one that requires more cache/registers.
As above with my suggestion of having hardware thread types that don't support SIMD or FP, I suppose if you had a control register that let you disable those instruction types for certain software threads you could avoid expensive saves for those registers on some context switches. I haven't really thought this through much so it may have some obvious flaws, but it would be a much easier way to reduce the burden of saving thread state than what the authors of this paper are recommending.