By: --- (---.delete@this.redheron.com), March 20, 2021 8:35 am
Room: Moderated Discussions
Kalle A. Sandström (spam.spam.beans.and.spam.ksandstr.delete@this.iki.fi) on March 20, 2021 6:34 am wrote:
> I wasn't going to wade into this discussion, but here we are. At least this isn't a shots-fired situation.
> My tl;dr is, roughly, that the cost of IPC (consisting of de-/serialization, transfer, syscall, and context
> switch, times two for a roundtrip) can be brought low enough to not matter much in comparison to various
> other cost centers that show up in real-world benchmarks at that level, such as cache misses, cross-CPU
> interactions, and VMM overhead. Indeed real-world people with real-world use cases already don't give
> much of a crap about how much and what the Mach-era message-passing microkernels of Windows NT, MacOS
> X, DragonflyBSD, QNX, or (lord forbid) GNU/Hurd suck. Sadly or joyfully according to taste.
> For the sake of having some proportion in the performance argument, with ten year old hardware (Sandy
> Bridge) a positive hash table lookup can cost as little as 80 cycles off a warm cache, i.e. on the second
> call. That's the price of looking up a per-client structure when a syscall is implemented as IPC to a
> server task, when no fancy "inject PID bits into a statically reserved address range" shenanigans are
> applied. Every core of a 2.8GHz CPU can do 35 million of those every second, or some 580k per frame.
> But to rebuff the major microkernel wibble ITT, I'd point out that the "true" costs of IPC architecture are
> the consequences of separating the system into parts that then use an IPC mechanism to synchronize between
> themselves. Such as the scars of having any "system-space" APIs at all. This cost is not overcome by ideas
> about what amounts to the "kick the task's address space etc. and restart w/ fingers crossed" spinlock school
> of reliability, or woolly fanboy arguments regarding security (given that no workable model that applies to
> microkernels alone has appeared in the past 20 years) or modularity (when the lock-free/wait-free VM stack
> exists precisely nowhere, so there'll always be subtle interactions between tasks in software already).
> As such I'll address only points concerning architectural downsides
> of IPC below. Irrelevant quotes redacted for brevity.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 16, 2021 7:25 pm wrote:
> >
> > Absolutely nobody sane disputes that the IPC overhead is a big and very real deal.
> > Even the people who seem to be arguing for microkernels seem to be admitting it.
> >
> I plead quarter lunacy in that IPC overhead is a sub-proportionate effect of exploiting
> hardware features (to ends such as testability, implementor convenience, and ideas
> about write-once-run-forever) rather than avoiding their use. Whether that benchmarks
> acceptably is a research question; presumably nobody would mind if it did.
> > It's something to generally be avoided, but sometimes you can't (or you don't care
> > about the downsides, and have a load where it works for you). So if your sandboxing
> > capabilities can't do it any other way, you fall back to some kind of IPC model.
> >
> Avoidance of IPC tends towards the "maximalist" microkernel operating system, where there's a monolithic
> kernel that runs as a big old specially-privileged userspace task and user processes around that, and
> the actual microkernel's job being that of a bunch of dead weight that does nothing except empty the
> fridge. This is in the name of reducing IPC overhead while delivering exactly no upside even on paper.
> That's the worst possible architecture, and I would assume it's not what's being discussed.
> > It's not that some RPC-=based model would always a bad idea. Sometimes it's very much called for.
> > Sometimes it's the only way to do it, and sometimes it's a really convenient model to take. But IPC
> > isn't simple - particularly not some reasonably high-performance one - it adds a lot of complexity,
> > and it often adds a lot of overhead unless your hardware is explicitly designed for it.
> >
> IPC is, in fact, hard -- but the problem is soluble, and the solution is justifiable by reduced latency
> between (say) an user program and the X server. The immediate overhead cost in terms of clock cycles can
> be negligible (depending on what one is willing to, uh, neglige), but the greater cost is reliance on a
> microkernel that (say) can't use a kernel stack per thread and therefore can only suspend in-kernel execution
> explicitly, which complicates e.g. interfaces that cannot indicate an OOM situation to userspace.
> This question about "explicit detach" applies to all the monolithic-substituting services that a multiserver
> system would have, so the complexity and difficulty question does not end at the microkernel API.
> > Designing your system as if RPC is the be-all and end-all of all you do, and what the whole
> > system design is all about - that's just incredibly stupid.
> The problem with doing "sometimes" IPC is that when there exists IPC gubbins that're fast
> enough, _not_ using IPC is a downside for everything except performance. And unless the
> appropriate tuning order is followed with the use of rigorous benchmarking, designing for
> performance is premature -- especially when performance is defined in terms of RDTSC.
> > Because yes, the practical problems are huge and in many
> > cases basically insurmountable. Your memory manager
> > really does want to interact with your filesystems pretty
> > much directly, and they want to share data structures
> > so that they can do simple things like "I still use this page"
> > (ie memory management uses it for mapping, filesystem
> > uses it for reading or whatever) without sending a effing message to each other, for chrissake.
> >
> On the contrary, many of the practical problems can be surmounted with brute programmer effort.
> But that's a recipe for not crossing the finish line, so the essential point remains.
> In the specific case of filesystems interacting with VM and block devices however, a design based
> on separation of concerns applies: the filesystem deals with on-disk metadata and inode lifecycle,
> VM deals with page caching and memory management, and block devices deal with the driver state machine
> i.e. the request queue, interrupts, and I/O commands. This boils down to a model where an user process
> does send a read(2) equivalent to a filesystem and wait for a reply (in the space of a single microkernel
> syscall, though that's irrelevant), but counterintuitively the filesystem then, having validated the
> client and file handle, translates the handle/range pair into a inode/range/block-address triple,
> hands them over to the VM task, and bows out of the transaction thereafter.
> The VM task respectively either finds the affected pages in the cache and replies the read()
> call directly, or spawns block device operations to the effect of reading data right into the
> page cache and replies later, all without obligating more copies than the one to userspace.
> That all being said, that "copy to userspace" is still a copy between two userspace
> processes and therefore either slow or a bit of a research project on its own. And
> this only amounts to fewer IPC transactions compared to a naïve model.

"Copy to userspace" is still expensive (for now...) but there are already interesting things that can be done to speed it up, along with the other aspects of the problem.

Three representative Apple patents:

This one attaches an ID (they call it a threadID but many options are possible, including eg user/supervisor status). The idea is that
- every physical register (in principle only those that's backing a visible logical register, in practice I assume all of them) has a "threadID"
- the "CPU", on context switch, is given an "old register block address" and a "new register block context"
No registers are saved.
As the CPU executes in the new context, it may try to overwrite registers associated with the oldID, which will then be written out (ie on demand, one at a time not in a thundering herd) to the "register block address", while as registers associated with the new context are read, they will be pulled in, again on demand, from the "new register block context".

One can imagine many flaws in this, as described, but even as described it handles the particular case of A to B back to A (eg "simple" system call, or simple IPC to a worker task); the more complicated cases of A to B to god knows where will obviously require a forced save of what's not yet saved or additional machinery (like being able to store more than two address blocks attached to more than two threadIDs)

This is representative of an very clever pattern seen in a lot of Apple patents, but which I see rarely in textbooks or the literature. The common idea is to split a task into its minimal elements, and perform such of those elements as are safe to be performed in spite of ordering constraints, while only limiting minimal aspects of the task execution to the ordering constraints.
In this case, the job of DCBZ is split into two elements, an "inform other CPUs about the snoop changes" (which is speculatively safe) and the actual "line overwrite" which has to be delayed until the operation is non-speculative.

A similar version of this idea is seen in some Apple load/store patents. As soon as the address is translated without fault, appropriate prefetch and snoop ownership signals are sent out, although the final data load (or store to cache) is of course delayed as required by ordering. But the initial prefetch means that frequently much of the waiting in the load queue for the ordering to resolve can occur while the prefetch is pulling the data to L1D level.

And of course Apple have a hardware compression engine (of course they do! in addition to specialized instructions for that task on the CPU)
The relevance is that once you have an async engine like this, it's a trivial matter to modify it for other bulk manipulations of interest (crypto, of course, but also basic async "copy 16kB page A to page B"). Won't perfectly speed up the copy time, but can it allow it to be done without blowing caches, and overlapped with whatever else the OS can find to do before it goes to sleep with a Wait for Event wakeup from the async engine.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
x86 - why unite when you can fragment?anonymou52021/03/12 06:16 PM
  x86 - why unite when you can fragment?Linus Torvalds2021/03/13 01:18 PM
    x86 - why unite when you can fragment?Jon Masters2021/03/13 07:25 PM
      x86 - why unite when you can fragment?Jon Masters2021/03/13 07:44 PM
        x86 - why unite when you can fragment?Yuhong Bao2021/03/13 08:49 PM
        x86 - why unite when you can fragment?tt2021/03/20 09:30 AM
    x86 - why unite when you can fragment?Andrey2021/03/14 04:15 PM
      x86 - why unite when you can fragment?Linus Torvalds2021/03/14 04:58 PM
        x86 - why unite when you can fragment?anonymou52021/03/14 05:31 PM
          x86 - why unite when you can fragment?anon22021/03/14 08:07 PM
        Microkernel?Anon2021/03/14 11:49 PM
          Microkernel?none2021/03/15 12:37 AM
            Microkernel?Anon2021/03/15 01:56 AM
          Microkernel?anon22021/03/15 01:58 AM
            Microkernel?Simon Farnsworth2021/03/15 03:12 AM
              Microkernel?anon22021/03/15 04:53 AM
                Microkernel?Simon Farnsworth2021/03/15 06:56 AM
                  Microkernel?iz2021/03/15 08:10 AM
                    Microkernel?Anon2021/03/15 09:05 AM
                      Microkernel?iz2021/03/16 01:25 AM
                        Microkernel?Andrey2021/03/16 02:54 AM
                          Microkernel?iz2021/03/16 08:36 AM
                            Microkernel?Andrey2021/03/16 10:06 AM
                              Microkernel?anonymou52021/03/16 11:44 AM
                              Microkernel?iz2021/03/21 02:58 AM
                                Microkernel?Andrey2021/03/21 09:34 AM
                  Microkernel?anon22021/03/15 08:31 AM
                    Microkernel?Simon Farnsworth2021/03/16 04:42 AM
            Microkernel?Gabriele Svelto2021/03/15 03:21 AM
              Microkernel?anon22021/03/15 04:56 AM
                Microkernel?Gabriele Svelto2021/03/15 10:41 AM
                  Microkernel?anon22021/03/15 08:00 PM
                    Microkernel?Gabriele Svelto2021/03/16 07:23 AM
                      Microkernel?anon22021/03/16 05:13 PM
                        Microkernel?anon22021/03/16 05:16 PM
                    Microkernel?Gian-Carlo Pascutto2021/03/16 01:40 PM
                      Microkernel?anon22021/03/16 05:53 PM
                        Microkernel?Linus Torvalds2021/03/16 07:25 PM
                          Microkernel?Doug S2021/03/17 09:30 AM
                            Microkernel?Linus Torvalds2021/03/17 10:30 AM
                              Microkernel?Brendan2021/03/17 10:56 PM
                                Microkernel?Michael S2021/03/18 03:47 AM
                                  Microkernel?Brendan2021/03/18 09:07 AM
                              Microkernel?Jose2021/03/18 09:35 AM
                            Microkernel?zArchJon2021/03/18 05:42 PM
                          TransputerRichardC2021/03/17 09:47 AM
                          Microkernel?dmcq2021/03/17 11:15 AM
                            Microkernel?Linus Torvalds2021/03/17 11:59 AM
                              Microkernel?dmcq2021/03/17 12:38 PM
                              Microkernel?Adrian2021/03/17 01:00 PM
                              Microkernel?Ana R. Riano2021/03/18 04:33 AM
                              Microkernel?2021/04/30 04:52 PM
                          Microkernel?NvaxPlus2021/03/17 11:48 AM
                            Microkernel?Michael S2021/03/18 03:32 AM
                              Microkernel?Adrian2021/03/18 04:12 AM
                                Microkernel?dmcq2021/03/18 06:30 AM
                                  Microkernel?dmcq2021/03/18 06:55 AM
                                  Microkernel?Adrian2021/03/18 08:35 AM
                                    Microkernel?---2021/03/18 09:49 AM
                                    Microkernel?dmcq2021/03/18 10:59 AM
                                      Microkernel?dmcq2021/03/18 04:09 PM
                              Microkernel?---2021/03/18 09:27 AM
                          Microkernel?Kalle A. Sandström2021/03/20 06:34 AM
                            Microkernel?---2021/03/20 08:35 AM
                            Microkernel?anon22021/03/21 05:29 PM
            Microkernel?dmcq2021/03/15 04:06 AM
              Microkernel?anon22021/03/15 04:59 AM
                Microkernel?dmcq2021/03/15 11:51 AM
                  Microkernel?anon22021/03/15 08:31 PM
                    Microkernel?dmcq2021/03/16 09:17 AM
                      Microkernel?Jukka Larja2021/03/16 11:22 AM
                        Microkernel?dmcq2021/03/16 04:06 PM
                          Microkernel?Jukka Larja2021/03/17 03:42 AM
                            Microkernel?dmcq2021/03/17 07:00 AM
                      Microkernel?anon22021/03/16 05:26 PM
                    Microkernel?---2021/03/16 10:07 AM
            Microkernel?-.-2021/03/15 08:15 PM
              Microkernel?anon22021/03/15 09:18 PM
                Microkernel?Foo_2021/03/16 03:37 AM
                  Read the thread (NT)anon22021/03/16 05:27 PM
                    Already did (NT)Foo_2021/03/17 02:55 AM
                      Already didanon22021/03/17 03:46 AM
                        Already didEtienne Lorrain2021/03/18 02:31 AM
                Microkernel?-.-2021/03/17 05:04 AM
                  Microkernel?Gabriele Svelto2021/03/17 08:53 AM
                    Microkernel?-.-2021/03/17 02:43 PM
              Microkernel?dmcq2021/03/16 08:40 AM
        x86 - why unite when you can fragment?Konrad Schwarz2021/03/17 10:19 AM
    x86 - why unite when you can fragment?anonon2021/03/15 07:37 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊