Microkernel?

By: Kalle A. Sandström (spam.spam.beans.and.spam.ksandstr.delete@this.iki.fi), March 20, 2021 6:34 am
Room: Moderated Discussions
I wasn't going to wade into this discussion, but here we are. At least this isn't a shots-fired situation.

My tl;dr is, roughly, that the cost of IPC (consisting of de-/serialization, transfer, syscall, and context switch, times two for a roundtrip) can be brought low enough to not matter much in comparison to various other cost centers that show up in real-world benchmarks at that level, such as cache misses, cross-CPU interactions, and VMM overhead. Indeed real-world people with real-world use cases already don't give much of a crap about how much and what the Mach-era message-passing microkernels of Windows NT, MacOS X, DragonflyBSD, QNX, or (lord forbid) GNU/Hurd suck. Sadly or joyfully according to taste.

For the sake of having some proportion in the performance argument, with ten year old hardware (Sandy Bridge) a positive hash table lookup can cost as little as 80 cycles off a warm cache, i.e. on the second call. That's the price of looking up a per-client structure when a syscall is implemented as IPC to a server task, when no fancy "inject PID bits into a statically reserved address range" shenanigans are applied. Every core of a 2.8GHz CPU can do 35 million of those every second, or some 580k per frame.

But to rebuff the major microkernel wibble ITT, I'd point out that the "true" costs of IPC architecture are the consequences of separating the system into parts that then use an IPC mechanism to synchronize between themselves. Such as the scars of having any "system-space" APIs at all. This cost is not overcome by ideas about what amounts to the "kick the task's address space etc. and restart w/ fingers crossed" spinlock school of reliability, or woolly fanboy arguments regarding security (given that no workable model that applies to microkernels alone has appeared in the past 20 years) or modularity (when the lock-free/wait-free VM stack exists precisely nowhere, so there'll always be subtle interactions between tasks in software already).

As such I'll address only points concerning architectural downsides of IPC below. Irrelevant quotes redacted for brevity.

Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 16, 2021 7:25 pm wrote:
>
> Absolutely nobody sane disputes that the IPC overhead is a big and very real deal.
> Even the people who seem to be arguing for microkernels seem to be admitting it.
>

I plead quarter lunacy in that IPC overhead is a sub-proportionate effect of exploiting hardware features (to ends such as testability, implementor convenience, and ideas about write-once-run-forever) rather than avoiding their use. Whether that benchmarks acceptably is a research question; presumably nobody would mind if it did.

> It's something to generally be avoided, but sometimes you can't (or you don't care
> about the downsides, and have a load where it works for you). So if your sandboxing
> capabilities can't do it any other way, you fall back to some kind of IPC model.
>

Avoidance of IPC tends towards the "maximalist" microkernel operating system, where there's a monolithic kernel that runs as a big old specially-privileged userspace task and user processes around that, and the actual microkernel's job being that of a bunch of dead weight that does nothing except empty the fridge. This is in the name of reducing IPC overhead while delivering exactly no upside even on paper. That's the worst possible architecture, and I would assume it's not what's being discussed.

> It's not that some RPC-=based model would always a bad idea. Sometimes it's very much called for.
> Sometimes it's the only way to do it, and sometimes it's a really convenient model to take. But IPC
> isn't simple - particularly not some reasonably high-performance one - it adds a lot of complexity,
> and it often adds a lot of overhead unless your hardware is explicitly designed for it.
>

IPC is, in fact, hard -- but the problem is soluble, and the solution is justifiable by reduced latency between (say) an user program and the X server. The immediate overhead cost in terms of clock cycles can be negligible (depending on what one is willing to, uh, neglige), but the greater cost is reliance on a microkernel that (say) can't use a kernel stack per thread and therefore can only suspend in-kernel execution explicitly, which complicates e.g. interfaces that cannot indicate an OOM situation to userspace.

This question about "explicit detach" applies to all the monolithic-substituting services that a multiserver system would have, so the complexity and difficulty question does not end at the microkernel API.

> Designing your system as if RPC is the be-all and end-all of all you do, and what the whole
> system design is all about - that's just incredibly stupid.

The problem with doing "sometimes" IPC is that when there exists IPC gubbins that're fast enough, _not_ using IPC is a downside for everything except performance. And unless the appropriate tuning order is followed with the use of rigorous benchmarking, designing for performance is premature -- especially when performance is defined in terms of RDTSC.

> Because yes, the practical problems are huge and in many cases basically insurmountable. Your memory manager
> really does want to interact with your filesystems pretty much directly, and they want to share data structures
> so that they can do simple things like "I still use this page" (ie memory management uses it for mapping, filesystem
> uses it for reading or whatever) without sending a effing message to each other, for chrissake.
>

On the contrary, many of the practical problems can be surmounted with brute programmer effort. But that's a recipe for not crossing the finish line, so the essential point remains.

In the specific case of filesystems interacting with VM and block devices however, a design based on separation of concerns applies: the filesystem deals with on-disk metadata and inode lifecycle, VM deals with page caching and memory management, and block devices deal with the driver state machine i.e. the request queue, interrupts, and I/O commands. This boils down to a model where an user process does send a read(2) equivalent to a filesystem and wait for a reply (in the space of a single microkernel syscall, though that's irrelevant), but counterintuitively the filesystem then, having validated the client and file handle, translates the handle/range pair into a inode/range/block-address triple, hands them over to the VM task, and bows out of the transaction thereafter.

The VM task respectively either finds the affected pages in the cache and replies the read() call directly, or spawns block device operations to the effect of reading data right into the page cache and replies later, all without obligating more copies than the one to userspace.

That all being said, that "copy to userspace" is still a copy between two userspace processes and therefore either slow or a bit of a research project on its own. And this only amounts to fewer IPC transactions compared to a naïve model.

Rather, the speculated gain from filesystem-and-VM separation is related to the handling of metadata. As is well known, path resolution in POSIX may cross filesystem boundaries back and forth due to arbitrarily-complex symlinkage. This results in either a simplistic locking model that underperforms in a concurrent system, a massively complicated one with tricks and traps aplenty, or brain damage (I may be talking about soft updates here, but perhaps not). Microkernel organization where path resolution operates analoguously through hand-off IPC is amenable to a fourth model utilizing a combination of transactional memory and distributed transaction brokering; thereby yielding an optimistically completing path resolution method that, while ugly in the source and incurring plenty of overhead even within a single filesystem task, requires only linear brute programmer effort compared to the exponential fuckery of "add more locks".

Of course this exists only as a draft design in my little drawer of horrors, but the gist of it is sound. It's also testable on account of the transaction mechanism being testable where "lots of locks" will be full of surprises until validated through use, a property that resets to zero when modified.

> But latency can be a big deal in other situations - think a network driver or a block driver where we're talking
> below microsecond latencies on modern hardware.

Userspace I/O is something that monolithic kernels too must deal with eventually as DMA speeds approach and exceed 100 gig per second. If the solution doesn't boil down to "memory mapped buffers and a synchronization method", I will eat my hat (I don't wear a hat).

> All these things have tons of subtle interconnects that are not about "I'm sending a request to you". They
> are very much about co-operating with each other, and you want to have a shared request queue between the
> different pieces, you want to have visibility into (some of) the scheduler state, you want to have all these
> things where you just look at what is going on (and perhaps use an atomic op to change said state).
>

The question of sharing state between scheduler and polling driver comes down to the specified interface, whether it's possible to implement it in terms of the microkernel's primitives, and whether the solution breaks things elsewhere. Certainly the top-down hygiene dictated by academic microkernels is born of an era before such concerns.

> The filesystem code might also easily want to know whether the initiating
> user process might have a signal pending - maybe it can still cancel the
> whole thing if it can tell that the originating process is being killed.
>

The pending signal masks of all processes can be mapped into a concerned filesystem's address space at mount time and accessed at only slightly more TLB miss cost than in a monolithic kernel. I don't see how this would be useful in practice unless signal-provoked process death (or another form of cancellation) becomes part of (say) an elaborate cloud storage stack.

> So the whole belief that "user process sends a message to a filesystem process, and
> waits for the reply" is simply not true. Or rather, it's only true at such a high
> level that if you only see that part, you've lost sight of the underlying reality.
>

As beliefs go, it's not quite from the short-bus end of the pool. Hallucinations of the faithful aside.

> It was always insane to think that these pieces were unrelated and should be separate things,
> and only communicate with each other over some very limited and idealized channel.
>
> Why is this even a discussion any more? Microkernels failed.
> Give them up. You want a monolithic kernel. End of story.
>

Mach failed. Hurd is going nowhere fast. Windows NT, MacOS X, DragonflyBSD, and QNX persist. Microkernels in the 21st century are a topic of active experimental research, to be continued.

Even Mach's failure is debatable; it did spawn another generation's worth of research into not sucking like Mach did. Hardware has also changed from the '486 where performance was a matter of running the fewest instructions and pipeline massaging the ones that do get run.

> Yes, that monolithic kernel can then do RPC for the cases where that then makes sense.
>

But its mechanism will be necessarily worse due to the overhead of also implementing POSIX on the side, so that in practice many cases where in a multiserver microkernel design it makes every kind of sense to implement a feature in a distinct server will appear as built-ins in the monolithic version.

> If you really live in that kind of world, where SMP and cache coherence failed and
> will never succeed, and machines instead have thousands of nodes that all communicate
> fundamentally using message passing, then microkernels might make sense.
>

This is a world where threads and forks (per MAP_SHARED) don't exist. The Transputer was structurally incapable of supporting all of POSIX, a curiosity of the 1980s like the rest of the single-language computers. Cache coherency is the hardware designer's solution to concurrent access, and it's efficient and in the right place.

That being said, as historical curiosities go, the stack machine computer that ran only Ada was, in all its single-processor glory, cool as heck.

> But this thread somehow devolved into then discussing whether IPC is crazy talk. No, obviously not. But IPC being
> a valid and sane thing to do does not equate to microkernels being a valid and sane thing to do. See?
>
> Linus

From a perspective where issues of developer scaling (as in, say, path resolution) are solved by directing a horde of volunteers to each take a crack at it until something workable emerges, I should say that achieving 1% of where Linux is today using less than 1% the work would validate microkernels in the developer convenience department. As for sanity, well, there are those who put a saddle on that snake and yell "giddyup!".

t.
-KS
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
x86 - why unite when you can fragment?anonymou52021/03/12 06:16 PM
  x86 - why unite when you can fragment?Linus Torvalds2021/03/13 01:18 PM
    x86 - why unite when you can fragment?Jon Masters2021/03/13 07:25 PM
      x86 - why unite when you can fragment?Jon Masters2021/03/13 07:44 PM
        x86 - why unite when you can fragment?Yuhong Bao2021/03/13 08:49 PM
        x86 - why unite when you can fragment?tt2021/03/20 09:30 AM
    x86 - why unite when you can fragment?Andrey2021/03/14 04:15 PM
      x86 - why unite when you can fragment?Linus Torvalds2021/03/14 04:58 PM
        x86 - why unite when you can fragment?anonymou52021/03/14 05:31 PM
          x86 - why unite when you can fragment?anon22021/03/14 08:07 PM
        Microkernel?Anon2021/03/14 11:49 PM
          Microkernel?none2021/03/15 12:37 AM
            Microkernel?Anon2021/03/15 01:56 AM
          Microkernel?anon22021/03/15 01:58 AM
            Microkernel?Simon Farnsworth2021/03/15 03:12 AM
              Microkernel?anon22021/03/15 04:53 AM
                Microkernel?Simon Farnsworth2021/03/15 06:56 AM
                  Microkernel?iz2021/03/15 08:10 AM
                    Microkernel?Anon2021/03/15 09:05 AM
                      Microkernel?iz2021/03/16 01:25 AM
                        Microkernel?Andrey2021/03/16 02:54 AM
                          Microkernel?iz2021/03/16 08:36 AM
                            Microkernel?Andrey2021/03/16 10:06 AM
                              Microkernel?anonymou52021/03/16 11:44 AM
                              Microkernel?iz2021/03/21 02:58 AM
                                Microkernel?Andrey2021/03/21 09:34 AM
                  Microkernel?anon22021/03/15 08:31 AM
                    Microkernel?Simon Farnsworth2021/03/16 04:42 AM
            Microkernel?Gabriele Svelto2021/03/15 03:21 AM
              Microkernel?anon22021/03/15 04:56 AM
                Microkernel?Gabriele Svelto2021/03/15 10:41 AM
                  Microkernel?anon22021/03/15 08:00 PM
                    Microkernel?Gabriele Svelto2021/03/16 07:23 AM
                      Microkernel?anon22021/03/16 05:13 PM
                        Microkernel?anon22021/03/16 05:16 PM
                    Microkernel?Gian-Carlo Pascutto2021/03/16 01:40 PM
                      Microkernel?anon22021/03/16 05:53 PM
                        Microkernel?Linus Torvalds2021/03/16 07:25 PM
                          Microkernel?Doug S2021/03/17 09:30 AM
                            Microkernel?Linus Torvalds2021/03/17 10:30 AM
                              Microkernel?Brendan2021/03/17 10:56 PM
                                Microkernel?Michael S2021/03/18 03:47 AM
                                  Microkernel?Brendan2021/03/18 09:07 AM
                              Microkernel?Jose2021/03/18 09:35 AM
                            Microkernel?zArchJon2021/03/18 05:42 PM
                          TransputerRichardC2021/03/17 09:47 AM
                          Microkernel?dmcq2021/03/17 11:15 AM
                            Microkernel?Linus Torvalds2021/03/17 11:59 AM
                              Microkernel?dmcq2021/03/17 12:38 PM
                              Microkernel?Adrian2021/03/17 01:00 PM
                              Microkernel?Ana R. Riano2021/03/18 04:33 AM
                              Microkernel?2021/04/30 04:52 PM
                          Microkernel?NvaxPlus2021/03/17 11:48 AM
                            Microkernel?Michael S2021/03/18 03:32 AM
                              Microkernel?Adrian2021/03/18 04:12 AM
                                Microkernel?dmcq2021/03/18 06:30 AM
                                  Microkernel?dmcq2021/03/18 06:55 AM
                                  Microkernel?Adrian2021/03/18 08:35 AM
                                    Microkernel?---2021/03/18 09:49 AM
                                    Microkernel?dmcq2021/03/18 10:59 AM
                                      Microkernel?dmcq2021/03/18 04:09 PM
                              Microkernel?---2021/03/18 09:27 AM
                          Microkernel?Kalle A. Sandström2021/03/20 06:34 AM
                            Microkernel?---2021/03/20 08:35 AM
                            Microkernel?anon22021/03/21 05:29 PM
            Microkernel?dmcq2021/03/15 04:06 AM
              Microkernel?anon22021/03/15 04:59 AM
                Microkernel?dmcq2021/03/15 11:51 AM
                  Microkernel?anon22021/03/15 08:31 PM
                    Microkernel?dmcq2021/03/16 09:17 AM
                      Microkernel?Jukka Larja2021/03/16 11:22 AM
                        Microkernel?dmcq2021/03/16 04:06 PM
                          Microkernel?Jukka Larja2021/03/17 03:42 AM
                            Microkernel?dmcq2021/03/17 07:00 AM
                      Microkernel?anon22021/03/16 05:26 PM
                    Microkernel?---2021/03/16 10:07 AM
            Microkernel?-.-2021/03/15 08:15 PM
              Microkernel?anon22021/03/15 09:18 PM
                Microkernel?Foo_2021/03/16 03:37 AM
                  Read the thread (NT)anon22021/03/16 05:27 PM
                    Already did (NT)Foo_2021/03/17 02:55 AM
                      Already didanon22021/03/17 03:46 AM
                        Already didEtienne Lorrain2021/03/18 02:31 AM
                Microkernel?-.-2021/03/17 05:04 AM
                  Microkernel?Gabriele Svelto2021/03/17 08:53 AM
                    Microkernel?-.-2021/03/17 02:43 PM
              Microkernel?dmcq2021/03/16 08:40 AM
        x86 - why unite when you can fragment?Konrad Schwarz2021/03/17 10:19 AM
    x86 - why unite when you can fragment?anonon2021/03/15 07:37 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?