A LONG response to the paper

By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), May 22, 2021 6:15 am
Room: Moderated Discussions
I have tried to edit my previously written partial critique of the paper into a collection of responses to quotes from the paper. The organization is somewhat lacking; after I noticed cases of repeated content, I chose to prioritize posting latency over clarity and organization. (I.e., I gave up on writing a good post after having delayed for days.) The prevalance of tangential comments is significant (if less than many of my posts), and the length is absurd. Even so, a skimming of this post might be worthwhile to some.

"Logically, the availability of multiple threads per core is no different from the availability of multiple cores per CPU."

This statement seems under-informed. POWER7 provided software-specified, hardware-enforced priority control.

Also, MIPS MultiThreading Application Specific Extension (2006) Thread Context distinct from Virtual Processing Element; when taking an exception a VPE becomes single-threaded.

For an architecture without privileged context (like the Mill or My 66000), a minimal 'hardware thread' seems difficult to distinguish from a 'virtual processor'.

"software should have direct control over hardware threads. It should be able to start and stop them, i.e., determine which ones can be scheduled for execution on the pipeline, and modify registers belonging to different hardware threads."

Here I would argue that more hardware management might be appropriate. Software is less equipped to monitor finer-grained resource utilization. (Most architectures seem to limit microarchitectural resource utilization monitoring by software to performance counters. Detecting when priorities

"isolation necessary for hypervisors and sandboxes will be offered at low cost"

As mentioned later in this post, permission crossing is more expensive than necessary. Similarly interprocessor interrupts (and, more generally, remote procedure calls) are more expensive than necessary.

"invokes the general OS scheduler with unpredictable performance results"

There is no fundamental reason that tasks desiring specialized scheduling would have to use a general OS scheduler. The OS could even optimize for the critical case over the common case. With plentiful resources (execution, memory bandwidth, power, et al.) the OS might perform some work in the background to reduce the delay for critical requests.

"the kernel can designate a hardware thread per core per interrupt type. Each of these threads initially blocks on a memory address using an mwait instruction"
"The hardware thread becomes runnable and handles the event without the need to jump into an IRQ context and the associated overheads."

One does not need to have low latency full-context thread resumption to provide fast, low-overhead interrupts. One might increase the number of interrupt types rather than increase the handlers by linking to a mwait address. The distinction between a tagged interrupt table and a table for monitors does not seem especially great conceptually.

I very much like mwait (or really event-aware thread scheduling) and like the conceptual unification a similar interface might provide. However, the benefit of a clean novel interface over a kludgy extension of legacy interfaces may not be especially great and is mortaged over a longer period (the individual payments are lower even if the total cost is higher).

"the state management necessary when switching privilege levels within a hardware thread can take hundreds of cycles"

The cost for changing permissions is unnecessarily high. For a system call, there is no need for save/restore (not that such needs to be particularly expensive, especially as it can be done somewhat lazily by hardware if a context address is provided by the architecture) and the managing of a different permission set need not be fundamentally different than changing an FP rounding mode (which is often high latency but not nearly as much as a system call). Legacy architectures' use of privileged internal context makes such more complex, but even that would not seem especially problematic as most privileged actions relate to memory accesses, a few common cases might be fast-pathed, and more involved system calls would presumably both reduce the fraction of overhead and allow some delay before slow-path internal context changes are first used (once the permission change is non-speculative, behavior would be the same as a typical implementation).

Reducing this overhead does not require providing an abundance of thread contexts that are paused and ready to run with low latency.

Intel's Meltdown problem does hint that retrofitting fast system calls may excessively increase design complexity and introduce security holes.

"Kernels can then use FP and vector operations and link with libraries that use them without affecting the system call invocation latency."

Good support for lazy/opportunistic state saving and restoration would seem to reduce the importance of this issue. Saving and restoring the entire FP/vector context (for architectures like x86 with AVX-512) seems overkill for most system calls.

"There is no need to move into kernel space and invoke the scheduler." [for microkernel service use]

As with interrupt handling (and interprocessor interrupts/remote procedure calls), launching into a specific permission domain (and core) does not require persisting full thread contexts. Much of a service function's register context would be parameters, some might be derived from a template (which might be viewed simply as thread-type aware compression, replacing repeated values with a pointer to the value for its thread-type), and some might be somewhat arbitrarily generated (perhaps the handler's stack pointer).

(The distinction between a compressed register context for a thread and a handler template might not be great, but I received the impression that compression was not considered, especially not based on 'thread-type' and the interface for parameter passing is single register at a time "push". Generation of thread state seems not to map well to thread context caching, at least not as neatly as to the template concept — not that such is necessarily a practical idea.)

"Quick hand-offs between hardware threads allow isolation without loss of performance."

Again, the slowness of permission change is not fundamental. The overhead need not be that much greater than the overhead of a function call. For system calls, a change in microarchitecture would suffice. A more flexible interface (like My 66000's Port Holes) would be approximately the same performance. Most such transfers also do not use much register state that persists across calls.

"Event-based models are more difficult to work with than threading models due to their confusing control flow"

Although I had glanced at the referenced paper ("Why events are a bad idea (for high-concurrency servers)", Rob von Behren et al., 2003) before, I do not recall being fully convinced by its argument.

If threads are bound to specific cores, software's choice of which thread to request to handle a task seems more complex. E.g., choosing a remote core's thread because it has better data/code cache locality or is more likely to have underutilized execution resources could backfire if the core is in deep sleep.

"Proposed instructions" "monitor ", "mwait", "start ", "stop ", "rpull , , ", "rpush , , ", "rpull , , ", "invtid , "

Mitch Alsup's My 66000 has a cleaner interface; ordinary loads and stores are used to read and write a thread's registers. (I think I might prefer a seperate address space for such accesses to indicate to hardware the coherence/synchronization requirements.) Since many of the accesses to other threads are more like function calls with multiple arguments, I think a less RISC-like interface may be desirable.

When calling a service, one would often not care about the physical thread ID. The desired semantic seems to be more spawn thread with some parameters and specific base state of permissions and other context. If a virtual thread ID is bound to a specific physical thread ID and that to a specific core, unnecessary core/cluster hopping and excessive waiting for execution resources would be more likely. Hardware (ideally, I think, with software-provided information) would be in a better position to decide when a service should be executed nearby (communication overhead, latency, energy-efficiency, etc.) and how much service-specific caching (code, data, and even metadata such as branch prediction and prefetch behavior) should be weighed in the decision.

(Directly providing "parameters" for an I/O event handler, just as if it were a remote procedure call — treating I/O devices more similarly to processing agents, which in this model could use RPCs for service requests or work transfer/distribution, somewhat as I/O MMUs extended the similarity. In some cases that could avoid I/O device round-trip latency. Some have also suggested moving I/O handling threads nearer the devices to reduce latency; such threads could also benefit from a different microarchitecture since ILP, caching behavior, and other factors differ from more compute-oriented threads. The total context for an I/O event handler might be smaller and/or only a portion of the available context is initialized to non-zero values.)

(Direct insertion of data might also be applied to mwait. This would not provide much benefit — only avoiding an explicit loading of the value and if the idiom is common the microarchitecture could prefetch the value, perhaps into the store queue, to provide similar latency to loading into a register. Some uses of mwait would only wait for an event and not have an obvious result value; the unnecessary killing of a register value might not be a horrible inefficiency but it might be better to avoid such.)

The cost of instantiating a thread from a template does not seem to need to be extremely expensive, particularly for short-lived threads. Hardware could support copy-on-write and could defer page allocation (similar to the Mill's backless memory; this would seem particularly helpful for short-lived threads where physical memory allocation might be avoided entirely).

Data partitioning and synchronization would seem to require some thought. One would want to get the benefits of single-writer (or better single-user) at-a-time memory locations as well as data locality without making thread migration unnecessarily expensive (to exploit resource availability on other cores).

"Triggering an exception in a thread without a handler for that exception type indicates a serious kernel bug akin to a triple-fault, and can be handled by halting or resetting the CPU."

I am inclined to think that a localized resource constraint (comparable to stack size limiting) might be more consistent with fault isolating design. However, I have not thought about such, and even with my limited consideration (and limited familiarity with system software design) I could imagine resource-oriented limits substantially complicating interactions among components.

"Thread Descriptor Table ... may allow a user ptid to start, stop, and modify other less privileged registers in other ptids"

Aside from adopting the questionable privileged/unprivileged division, flexible permissions would seem to require a TDT for every "process". I think something more like My 66000's Port Holes (permission-changing function calls) and memory address permissions might be better.

"An alternative to the TDT could be a secret-key-based design."

I suspect people would be wary of key-based security. While even with ECC-protected memory there is some probability of a permission violation, the comfort-level with and/or understanding of such hardware-fault risk seems greater.

"Thus, with regard to Spectre, our proposal is no different from software-thread context-switching today: We still just multiplex on top of SMT threads."

Side channels will exist with any resource sharing; the finer the temporal granularity of sharing, the more bandwidth is likely to be available. The paper's proposal is somewhat different from software-thread context switching in that the frequency of such switching will be higher and resource sharing finer-grained (e.g., the proposal seems to assume fine-grained scheduling with intermixing). Even with only neutral or friendly threads sharing the pipeline at a given time, side channels would exist in the shared L1 caches (and branch predictors). Speculation-based side channels might be blocked from communicating to enemy threads (so secure software that considers such factors would still be secure), but I received the impression that some of the side channels of concern were more about L1 cache sharing than speculation-based side channels for software (like web browsers) that was not considered security-critical.

With one millisecond context switch times, side channel bandwidth would be lower than with 16 nanosecond switch times. An increase in the number of independent threads active could add noise to some side channels, so the effects might not be entirely negative. Threads that do not share memory could also use diverse mapping functions (using a per-thread constant to XOR with the address or scramble bits might not greatly increase latency) and "encryption".

"The state for additional hardware threads can be stored in large register files, similar to early multithreaded CPUs for coarse-grain thread interleaving [17–19]."

Ordinary unified register files containing tens of full thread contexts would be relatively expensive in area (as well as static power and access energy and latency — the latency might not be especially problematic but would seem to add scheduling and value storage/routing complexity). Using 3D register file techniques (Marc Tremblay, Bill Joy and Ken Shin, "A Three Dimensional Register File For Superscalar Processors", 1995) would reduce the area overhead by exploiting limited wire density in highly ported register files to hide additional memory cells, this is not sufficient for 10s of threads. (This technique could be considered temporal banking and was used by Intel for Itanium's Switch-on-Event MT.) This storage area optimization would still have significant static power costs, though it might be possible for currently inaccessible "banks" to have lower power (without excessively increasing area or access latency) and zero-value cells might be optimized for low leakage.

With large window out-of-order SMT, much of the register state is temporary and shared to enhance utilization. For in-order MT designs, techniques like Scout Threading may be used to increase resource utilization. Having an area- and power-consuming resource be mostly idle should trigger concern — not that utilization should be the optimized metric but utilization correlates somewhat with performance and cost.

"Modern GPUs have demonstrated the practicality of implementing multi-ported register files that store 10s of KBytes and support 10s of hardware threads."

I was under the impression that most GPUs exploited wide SIMD to read four times the amount of data used in a cycle of execution so that a single-ported register file could be used. Even with that optimization, significant research is published on reducing register file accesses and register file size (e.g., avoiding register file writeback for values that can be "forwarded" from recent operations). GPUs are also otherwise optimized for throughput over latency.

This weak understanding of register files seems to indicate lack of consulation with people more familiar with physical implementations. I am not a hardware person, but I am aware that smaller is faster and lower power/energy (both significant considerations for latency-optimized processors) and that multiporting increases area (such that 3D register files and replicated register files make sense and GPUs often use single-ported register files). If I were seriously proposing something like what the paper proposes (as opposed to posting such in a journal letter or less formal WACI-like talk — or just a web/USENET forum post), I would want to consult a real hardware person about the hardware considerations. On the other hand, this might be a case of unknown unknowns.

"selecting which threads are stored closer to the core based on criticality, tracking used/modified registers to avoid redundant transfers, and hardware prefetching of the state of recently woken up threads closer to the processor core."

This statement shows that some thought was given to storage location. Even with conceived rigidity of hardware thread count and localized context storage, it would be possible to for unused registers to provide additional cache capacity and not just reduced bandwidth utilization. Providing variable context sizes (or at least encouraging threads to utilize certain groups of registers and leave others zeroed so that hardware could trivial compress state) seems desirable but was not mentioned. For higher latency context switches, compression would seem reasonable (with spare storage capacity provided to cache or metadata storage).

Criticality might also apply to the values within a thread's register context. Data addresses might typically be more critical than ordinary data (though some ordinary data might be critical for branch misprediction detection). Storing critical portions of more threads closer to a core might be more desirable, though the difference in latency seems unlikely to be sigificant — this allocation could be managed somewhat dynamically. Such also runs afoul of migration costs as more distant context storage might reasonably closer to an alternative execution resource/core.

Variable utilization/thread context storage demand seems not to have been considered. There is a tradeoff between specialization (including proximity and tight coupling optimizations) and utilization. The bandwidth capacity of a specialized thread context storage area could be useful for L1 bypassing streaming accesses (that also have some blocking factor such that even L2 or L3 might also be bypassed or only used as a prefetch buffer).

There would also seem to be synergies with power saving and context storage; one Intel Atom implementation provided a special low static power context storage to allow fast restart — in theory such storage could be extended to support more general SoEMT (Atom's version could be viewed as switching between an idle thread with effectively no state and a stateful paused thread).

For "Non-register State" only TLBs and caches are considered. Branch prediction state could also be important. (Early branch prediction papers noted the problem of operating system access contaminating branch predictors; the proposed design seems likely to increase such conflict issues.) Per-address metadata might be cached with cached instructions (even out to L3 caches), and some threads might not use global history branch prediction.

Performance counters are another thread state component that is not mentioned. If performance counters are shared across permission domains for the same task (which would have some advantages for resource accounting), one might need extra care to avoid dangerous communication. Counters do have the advantage of only caring about the least significant bits at fine time granularity [overflow/static match checks can "pre-check" the more significant bits and cache the test result], so more significant bits might be stored farther from the heart of the core.

With virtually tagged caches and segment-oriented permissions, the TLB could be distanced from the core and its capacity more broadly shared. (An interesting possibility for Single Address Space OS hardware with share translations and separate permissions would be sharing a translation between isolated partial-page chunks and/or growing in opposite directions; the slightly larger page size advantage from such seems likely to be tiny, but the concept seems interesting enough to mention.) The metadata for a small number of segments might be more reasonable to cache closer to the core than the equivalent in PTEs.

The restriction to on-chip storage also misses the opportunity for coarser temporal granularity context switches. A thread waiting for a high latency event or an available time slice might be profitably "swapped" to main memory. If the minimum event latency is known, such might be used to prefetch the thread context to provide "zero" latency at the cost of early prefetching wasting on-chip storage capacity. For time slice thread activation, there may be less distinction between prefetching just in time and switching when data is available. Phase-based (whether boundaries are specified by software or detected by hardware or some combination) thread switching might be a little more challenging.

(Resource markets and task budgets might provide incentives for tasks to be nice [Unix pun intended]. A frugal task might be given more frequent time slices or favorable core choices simply because it has the budget to spend on such. Willingness to share a core [if the thread has, e.g., no private/secure data or low ILP benefit] might mean only paying 75% of cost of an isolated thread. A reduced-context thread might pay less than a full-context thread. Budget might also be reserved for critical activities, using less resources ordinarily [or when market demand is high] and increasing spending/resource use for critical portions. One might also imagine contracts, where a resource guarantee is provided [possibly with penalties/refunds for failures of softer guarantees] at a premium.)

I suspect for many of the handler-type threads, avoiding read-for-ownership might noticeably increase performance. I.e., less data persists across thread activations, so low-cost installation of cache blocks with new data could be especially beneficial.

"In our proposal, hardware threads use monitor - mwait to express blocking and wakeup conditions"

As already noted here, this places a constraint on the number of "hardware threads" as eviction from a cache of a monitor address (and associated thread ID) would result in that event never registering. The number of monitored addresses could be enormous with modest overhead, e.g., by using a tagging structure similar to a V-way cache (where more tags are available than cache blocks ["The V-Way Cache : Demand-Based Associativity via Global Replacement", Moinuddin K. Qureshi et al., 2005], which also fits with cache compression, including zero block tracking) with skewed associativity and cuckoo replacement. It would also be possible to "overflow" into an in-memory search structure with an on-chip conservative filter.

x86 mwait has the advantage of allowing false positives (end-wait) and relying on OS rescheduling to re-test the condition. Using the equivalent of a scanning garbage collector (with no-active-monitor "leaf" nodes not scanned) would seem to introduce excessive delay and all possible triggers would have to be cached for checking.

Explicit indication that an operation may/will/should trigger an event would allow indexing the overflow structure whenever a trigger does not match a cached entry, without the issues of a conservative filter (such as markers getting stuck on when a count overflows — though one might bias replacement to reduce the frequency of such and a GC-style scan of the overflow structure might be used to reset the filter with significant overheads).

"Hardware support will be needed for fine-grain tracking of threads’ resource consumption for cloud billing or software decisions"

This hints at the resource market/task budget concept mentioned above. This also implies more flexible allocation of threads to cores than proposed, though some important resources are more global.

(Tangential thought related to the global resource of shared but sliced L3: Memory addresses that are known to be constrained to a specific set of threads could be assigned to a thread-set home node (or collection of nodes) rather than using the physical address to assign a home node. Coherence traffic and access latency could be reduced by matching the home node with the placement of executing threads in the set. Full capacity use might be maintained by overflowing to the physical address home node, probably with tracking of whether overflow occurred (or using a finer-grained filter); a non-local thread might check both locations in parallel. IO coherence would be more complicated (perhaps IO MMU entries would include the home node). If the mapping of physical address to cache slice provided something like memory-side caching (i.e., proximity to the memory controller), or even a bias to such allocation, on chip network bandwidth use and some latency might be reduced. When (as would be common) the desired locations are other than the address-based or single thread-based home node, managing replication would seem more complex.)

"Part of scheduling, such as associating hardware threads with I/O events, could also be transparently offloaded to peripheral devices such as smartNICs"

This seems to be the only hint in the paper of the application to heterogeneous execution environments. Thread allocation and migration seems likely to benefit from substantial hardware involvement, similarly to power management benefiting from hardwrae involvement (due to temporal granularity/management overhead and microarchitectural specificity).

I do not want to be overly harsh. A Hot OS paper would reasonably focus more on USENIX-like references than references to computer architecture/hardware history. Since I am unfamiliar with most of the references, it seems more reasonable that the authors might be unfamiliar with various computer architecture concepts and references. I also realize that published papers have length limits (unlike technical reports)

Even so I am disappointed with no mention of ARM's FIRQ partial contexts, MIPS' Register Sets (for fast interrupts, these were overlaid with thread contexts in the MT-ASE), or Itanium's banked general registers (in the context of faster interrupts/exceptions). Mentioning other multithreading design points might have been nice, like Cray MTA and various SoEMT designs such as IBM's *star POWER implementations and MT Itaniums.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
A Case Against (Most) Context SwitchesLittle Horn2021/05/17 05:03 PM
  A Case Against (Most) Context Switchesrwessel2021/05/17 06:55 PM
  A Case Against (Most) Context SwitchesFoo_2021/05/18 01:58 AM
    A Case Against (Most) Context SwitchesDoug S2021/05/18 08:45 AM
      A Case Against (Most) Context SwitchesKonrad Schwarz2021/05/19 07:35 AM
  A Case Against (Most) Context SwitchesEtienne Lorrain2021/05/18 03:11 AM
  A Case Against (Most) Context SwitchesAndrey2021/05/18 06:58 AM
  A Case Against (Most) Context Switchesgallier22021/05/18 08:41 AM
  A Case Against (Most) Context Switches---2021/05/18 09:00 AM
  A Case Against That Other PaperBrendan2021/05/18 12:37 PM
    A Case Against That Other PaperMark Roulo2021/05/18 03:32 PM
      A Case Against That Other PaperBrendan2021/05/18 11:05 PM
        A Case Against That Other PaperMark Roulo2021/05/19 01:09 PM
  A Case Against (Most) Context SwitchesRomain Dolbeau2021/05/19 04:05 AM
    A Case Against (Most) Context SwitchesBjörn Ragnar Björnsson2021/05/19 01:13 PM
      A Case Against ... authors show zero awareness of Cray-MTABjörn Ragnar Björnsson2021/05/19 06:18 PM
    Cray MTA avoided cachesPaul A. Clayton2021/05/20 06:36 AM
      Cray MTA avoided cachesdmcq2021/05/20 10:09 AM
        Cray MTA avoided cachesRayla2021/05/20 10:28 AM
      A LONG response to the paperPaul A. Clayton2021/05/22 06:15 AM
        A LONG response to the paperAdrian2021/05/22 09:18 AM
          Thank you for the note of appreciationPaul A. Clayton2021/05/24 05:06 AM
  A Case Against (Most) Context Switchesdmcq2021/05/19 01:47 PM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊