A Case Against That Other Paper

By: Mark Roulo (nothanks.delete@this.xxx.com), May 19, 2021 1:09 pm
Room: Moderated Discussions
Brendan (btrotter.delete@this.gmail.com) on May 18, 2021 11:05 pm wrote:
> Mark Roulo (nothanks.delete@this.xxx.com) on May 18, 2021 3:32 pm wrote:
> > Brendan (btrotter.delete@this.gmail.com) on May 18, 2021 12:37 pm wrote:
> > > A Case Against That Other Paper
> > >
> > > By Brendan and a Rubber Duck
> > >
> > >
> > > Introduction
> > >
> > > CPUs with lots of hardware threads suffer severe problems for all forms of caching (including branch
> > > prediction and TLBs) while also destroying any hope of effective hardware prefetching (for instruction,
> > > data and TLB) as there's no hope of any kind of locality between disparate threads; causing cache
> > > thrashing, and exacerbating the "1000+ channel memory controller doesn't exist" problem. With significantly
> > > fewer hardware threads (no more than 4) we believe a CPU can actually work properly, leading to
> > > an order of magnitude better performance for embarrassingly parallel workloads and 2 or more orders
> > > of magnitude better performance for anything subject to Amdahl's law.
> > >
> > > Currently, all of the CPUs that provide many hardware threads
> > > get fundamental parts of task switching wrong.
> >
> > The paper proposes that the SOFTWARE control the task switching,
> > which is quite different from today's SMT/Hyperthreading.
> Are you sure? To me it looks like a huge number of hardware threads (for IRQ handlers, syscall
> and exception handlers, VM-exit handlers, user-space tasks waiting for IO, all inter-process communication,
> ...) are supposed to be using "monitor/mwait" to block/unblock without any kernel involvement;
> and almost all of their claims ("no more interrupts", "Fast I/O without Inefficient Polling", "nanosecond
> scale task switches", .. ) are based on this fast blocking/unblocking.

No, I'm not sure :-)

The paper implies that an OS call could be mapped as a user mode thread unblocking/starting a kernel mode thread (with no scheduling required), which converts an OS trap into a fast function call into a kernel (fast because the thread on the other side already exists and the scheduler doesn't need to run and ...).

I believe that the author wants fast thread switching rather than 1000-wide thread multiplexing.

Storage for Thread State: Our proposal relies on hard-ware to store state for a large number of threads so that start and stop are fast (nanosecond scale). The state includes all general-purpose and control registers. For x86-64,a thread has 272 bytes of register state that goes up to 784 bytes if SSE3 vector extensions are used.A first option is to implement hardware threads as SMT hyperthreads in modern CPUs. Hyperthreads execute concurrently by sharing pipeline resources in a fine-grain manner. Their implementation is expensive as all pipeline buffers must be tagged, partitioned, or replicated. This is why most popular CPUs support up to 2 hyperthreads and few designs have gone up to 4 or 8 hyperthreads. We believe that the two concerns should be separated: use a small number of hyperthreads to best utilize the complex pipeline (likely 2-4) and multiplex additional runnable hardware threads on the available hyperthreads in hardware.The state for additional hardware threads can be stored in large register files, similar to early multithreaded CPUs for coarse-grain thread interleaving. The overhead of starting execution of a thread stored in these register files would be proportional to the length of the pipeline, roughly20 clock cycles in modern processors.

The paper talks about micro-kernels (and I think a lot of the intended benefit is for micro-kernels and hypervisors ...)

Fundamentally, these optimizations are based on the insight that it can be faster and simpler to start and stop a large number of hardware threads rather than to frequently multiplex a large number of software contexts on top of a single hardware thread. In addition to requiring mode changes and state swapping, the latter often invokes the general OS scheduler with unpredictable performance results.The cost of starting and stopping the execution of hardware threads can be kept low, and its impact on overall performance can be predictable.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
A Case Against (Most) Context SwitchesLittle Horn2021/05/17 05:03 PM
  A Case Against (Most) Context Switchesrwessel2021/05/17 06:55 PM
  A Case Against (Most) Context SwitchesFoo_2021/05/18 01:58 AM
    A Case Against (Most) Context SwitchesDoug S2021/05/18 08:45 AM
      A Case Against (Most) Context SwitchesKonrad Schwarz2021/05/19 07:35 AM
  A Case Against (Most) Context SwitchesEtienne Lorrain2021/05/18 03:11 AM
  A Case Against (Most) Context SwitchesAndrey2021/05/18 06:58 AM
  A Case Against (Most) Context Switchesgallier22021/05/18 08:41 AM
  A Case Against (Most) Context Switches---2021/05/18 09:00 AM
  A Case Against That Other PaperBrendan2021/05/18 12:37 PM
    A Case Against That Other PaperMark Roulo2021/05/18 03:32 PM
      A Case Against That Other PaperBrendan2021/05/18 11:05 PM
        A Case Against That Other PaperMark Roulo2021/05/19 01:09 PM
  A Case Against (Most) Context SwitchesRomain Dolbeau2021/05/19 04:05 AM
    A Case Against (Most) Context SwitchesBjörn Ragnar Björnsson2021/05/19 01:13 PM
      A Case Against ... authors show zero awareness of Cray-MTABjörn Ragnar Björnsson2021/05/19 06:18 PM
    Cray MTA avoided cachesPaul A. Clayton2021/05/20 06:36 AM
      Cray MTA avoided cachesdmcq2021/05/20 10:09 AM
        Cray MTA avoided cachesRayla2021/05/20 10:28 AM
      A LONG response to the paperPaul A. Clayton2021/05/22 06:15 AM
        A LONG response to the paperAdrian2021/05/22 09:18 AM
          Thank you for the note of appreciationPaul A. Clayton2021/05/24 05:06 AM
  A Case Against (Most) Context Switchesdmcq2021/05/19 01:47 PM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊