A Case Against That Other Paper

By: Brendan (btrotter.delete@this.gmail.com), May 18, 2021 11:05 pm
Room: Moderated Discussions
Mark Roulo (nothanks.delete@this.xxx.com) on May 18, 2021 3:32 pm wrote:
> Brendan (btrotter.delete@this.gmail.com) on May 18, 2021 12:37 pm wrote:
> > A Case Against That Other Paper
> >
> > By Brendan and a Rubber Duck
> >
> >
> > Introduction
> >
> > CPUs with lots of hardware threads suffer severe problems for all forms of caching (including branch
> > prediction and TLBs) while also destroying any hope of effective hardware prefetching (for instruction,
> > data and TLB) as there's no hope of any kind of locality between disparate threads; causing cache
> > thrashing, and exacerbating the "1000+ channel memory controller doesn't exist" problem. With significantly
> > fewer hardware threads (no more than 4) we believe a CPU can actually work properly, leading to
> > an order of magnitude better performance for embarrassingly parallel workloads and 2 or more orders
> > of magnitude better performance for anything subject to Amdahl's law.
> >
> > Currently, all of the CPUs that provide many hardware threads
> > get fundamental parts of task switching wrong.
> The paper proposes that the SOFTWARE control the task switching,
> which is quite different from today's SMT/Hyperthreading.

Are you sure? To me it looks like a huge number of hardware threads (for IRQ handlers, syscall and exception handlers, VM-exit handlers, user-space tasks waiting for IO, all inter-process communication, ...) are supposed to be using "monitor/mwait" to block/unblock without any kernel involvement; and almost all of their claims ("no more interrupts", "Fast I/O without Inefficient Polling", "nanosecond scale task switches", .. ) are based on this fast blocking/unblocking.

> The idea is that task switching will be faster if each task stores its register state in H/W
> rather than in memory (cache or DRAM). The OS, presumably, would still be responsible for selecting
> the thread(s) to run at any one time, but the cost of a task switch would be low.

Ignoring SIMD state; a low level software task switch mostly only has to save/restore "callee preserved" registers, (for System V/AMD64 calling conventions, its RBX, RSP, RBP, and R12–R15). Sure there's a couple of other details (loading but not saving CR3 if there's a virtual address space change, setting an "ESP0" field in the TSS for some/most kernel designs); and saving/restoring SIMD state isn't so cheap (but is avoidable in a lot of cases if you care - e.g. a "don't save unless actually used, zero instead of loading, postpone loading until actual use" scheme); but the cost of the low level task switch itself still falls into the "LOL, who cares" territory.

It's everything else (managing blocked/unblocked states, deciding which task to give CPU time, tracking statistics, the cost of "change in working set" on caches, etc) that cause almost all of the overhead; so improving the almost irrelevant part (low level task switch) while keeping everything responsible for overhead the same wouldn't make much sense.

> I'm envisioning something like SPARCs register windows, just used for a different
> purpose and MUCH larger. Or like the Z80 banked registers, just much more so.

I'm imagining "barrel processor with hyper-threading" (e.g. switching between pairs of hardware threads every few cycles); but that's only one part of it (when tasks are actively running and their state is kept in the CPU's register file).

For when tasks are not actively running (blocked, "mwaiting"), I'm also imagining their state being kept in a special private "cache like" area in the CPU, where blocking/unblocking causes task state to be transferred between register file and "special private cache-like storage".

Note that the paper doesn't actually say - the "Storage for Thread State" section on page 21 describes 2 possible options ("huge register file only" as one option, and "private (per-core) L2 caches or the shared L3 caches, to also store state for the additional hardware threads" as the another option). I merely assumed that their first option ("huge register file only") won't work in practice (that the register file in an out-of-order CPU where execution units are competitively shared by threads won't scale as much as it can for a GPU where execution units are "per hardware thread" and not shared).

- Brendan
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
A Case Against (Most) Context SwitchesLittle Horn2021/05/17 05:03 PM
  A Case Against (Most) Context Switchesrwessel2021/05/17 06:55 PM
  A Case Against (Most) Context SwitchesFoo_2021/05/18 01:58 AM
    A Case Against (Most) Context SwitchesDoug S2021/05/18 08:45 AM
      A Case Against (Most) Context SwitchesKonrad Schwarz2021/05/19 07:35 AM
  A Case Against (Most) Context SwitchesEtienne Lorrain2021/05/18 03:11 AM
  A Case Against (Most) Context SwitchesAndrey2021/05/18 06:58 AM
  A Case Against (Most) Context Switchesgallier22021/05/18 08:41 AM
  A Case Against (Most) Context Switches---2021/05/18 09:00 AM
  A Case Against That Other PaperBrendan2021/05/18 12:37 PM
    A Case Against That Other PaperMark Roulo2021/05/18 03:32 PM
      A Case Against That Other PaperBrendan2021/05/18 11:05 PM
        A Case Against That Other PaperMark Roulo2021/05/19 01:09 PM
  A Case Against (Most) Context SwitchesRomain Dolbeau2021/05/19 04:05 AM
    A Case Against (Most) Context SwitchesBjörn Ragnar Björnsson2021/05/19 01:13 PM
      A Case Against ... authors show zero awareness of Cray-MTABjörn Ragnar Björnsson2021/05/19 06:18 PM
    Cray MTA avoided cachesPaul A. Clayton2021/05/20 06:36 AM
      Cray MTA avoided cachesdmcq2021/05/20 10:09 AM
        Cray MTA avoided cachesRayla2021/05/20 10:28 AM
      A LONG response to the paperPaul A. Clayton2021/05/22 06:15 AM
        A LONG response to the paperAdrian2021/05/22 09:18 AM
          Thank you for the note of appreciationPaul A. Clayton2021/05/24 05:06 AM
  A Case Against (Most) Context Switchesdmcq2021/05/19 01:47 PM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊