Microkernel?

By: anon2 (anon.delete@this.anon.com), March 21, 2021 5:29 pm
Room: Moderated Discussions
Kalle A. Sandström (spam.spam.beans.and.spam.ksandstr.delete@this.iki.fi) on March 20, 2021 6:34 am wrote:
> I wasn't going to wade into this discussion, but here we are. At least this isn't a shots-fired situation.

What is a shots-fired situation?

>
> My tl;dr is, roughly, that the cost of IPC (consisting of de-/serialization, transfer, syscall, and context
> switch, times two for a roundtrip) can be brought low enough to not matter much in comparison to various
> other cost centers that show up in real-world benchmarks at that level, such as cache misses, cross-CPU
> interactions, and VMM overhead.

This isn't really true, for many kinds of interactions between subystems that you would like. People claim to solve it by hacking out all memory protection and privilege domains between the subsystems and then hack your message passing until it can work as close to a function call passing arguments in shared memory as possible.

Even then it's slower and more cumbersome, so they then set about jumping through hoops to avoid what would otherwise be very natural calls between subsystems, adding more complexity, buffering things up that don't need to be buffered, deferring work, caching results, over-provisioning, etc which all conspire to add complexity and waste cache footprint.

> Indeed real-world people with real-world use cases already don't give
> much of a crap about how much and what the Mach-era message-passing microkernels of Windows NT, MacOS
> X, DragonflyBSD, QNX, or (lord forbid) GNU/Hurd suck. Sadly or joyfully according to taste.

Which real-world people and real-world use cases?

I would venture orders of magnitude more real world people and use cases actually do care about NT OSX and possibly QNX than the alleged real world people who have their own far better microkernels.

NT and OSX perform quite well thanks to pulling in vast amounts of monolithic code, network stacks drivers, etc into their kernels. Also, DragonflyBSD is not a microkernel.

http://lists.dragonflybsd.org/pipermail/kernel/2003-October/273060.html

"I see no advantage at all in trying to convert the system wholely to a microkernel design, other then to slow it down and make the source code harder to understand :-)"

>
> For the sake of having some proportion in the performance argument, with ten year old hardware (Sandy
> Bridge) a positive hash table lookup can cost as little as 80 cycles off a warm cache, i.e. on the second
> call. That's the price of looking up a per-client structure when a syscall is implemented as IPC to a
> server task, when no fancy "inject PID bits into a statically reserved address range" shenanigans are
> applied. Every core of a 2.8GHz CPU can do 35 million of those every second, or some 580k per frame.

That's not *the* price, that is *part* of the price. Waiting for the server task to become available and accept your request is a whole other side of the problem. I'll also note that an entire end-to-end NULL system call on a monolithic kernel is on the order of 100-200 cycles, that 80 cycles is quite a lot.

But system calls are not the worst of it. You are already changing privilege and taking a significant cost there. If you then have to go through a bunch more of these RPC calls in the kernel to do any useful work, things really start to hurt.

>
> But to rebuff the major microkernel wibble ITT, I'd point out that the "true" costs of IPC architecture are
> the consequences of separating the system into parts that then use an IPC mechanism to synchronize between
> themselves. Such as the scars of having any "system-space" APIs at all. This cost is not overcome by ideas
> about what amounts to the "kick the task's address space etc. and restart w/ fingers crossed" spinlock school
> of reliability, or woolly fanboy arguments regarding security (given that no workable model that applies to
> microkernels alone has appeared in the past 20 years) or modularity (when the lock-free/wait-free VM stack
> exists precisely nowhere, so there'll always be subtle interactions between tasks in software already).
>
> As such I'll address only points concerning architectural downsides
> of IPC below. Irrelevant quotes redacted for brevity.
>
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 16, 2021 7:25 pm wrote:
> >
> > Absolutely nobody sane disputes that the IPC overhead is a big and very real deal.
> > Even the people who seem to be arguing for microkernels seem to be admitting it.
> >
>
> I plead quarter lunacy in that IPC overhead is a sub-proportionate effect of exploiting
> hardware features (to ends such as testability, implementor convenience, and ideas
> about write-once-run-forever) rather than avoiding their use. Whether that benchmarks
> acceptably is a research question; presumably nobody would mind if it did.
>
> > It's something to generally be avoided, but sometimes you can't (or you don't care
> > about the downsides, and have a load where it works for you). So if your sandboxing
> > capabilities can't do it any other way, you fall back to some kind of IPC model.
> >
>
> Avoidance of IPC tends towards the "maximalist" microkernel operating system, where there's a monolithic
> kernel that runs as a big old specially-privileged userspace task and user processes around that, and
> the actual microkernel's job being that of a bunch of dead weight that does nothing except empty the
> fridge. This is in the name of reducing IPC overhead while delivering exactly no upside even on paper.
> That's the worst possible architecture, and I would assume it's not what's being discussed.
>
> > It's not that some RPC-=based model would always a bad idea. Sometimes it's very much called for.
> > Sometimes it's the only way to do it, and sometimes it's a really convenient model to take. But IPC
> > isn't simple - particularly not some reasonably high-performance one - it adds a lot of complexity,
> > and it often adds a lot of overhead unless your hardware is explicitly designed for it.
> >
>
> IPC is, in fact, hard -- but the problem is soluble, and the solution is justifiable by reduced latency
> between (say) an user program and the X server.

I disagree. Not even the immediate overhead of clock cycles and cache unfriendliness.

> The immediate overhead cost in terms of clock cycles can
> be negligible (depending on what one is willing to, uh, neglige), but the greater cost is reliance on a
> microkernel that (say) can't use a kernel stack per thread and therefore can only suspend in-kernel execution
> explicitly, which complicates e.g. interfaces that cannot indicate an OOM situation to userspace.

This of course is the other huge problem that microkernels have. The immediate cost of IPC and context switching etc is one thing. Managing and scheduling all these contexts you have is a whole different problem, which by no means becomes easier with many CPUs in a system and having to provision enough contexts to run performance critical services local to the CPU where requests come from, etc. Even scheduling today's userspace applications on a monolithic kernel on modern multicore systems is a beast of a problem.

>
> This question about "explicit detach" applies to all the monolithic-substituting services that a multiserver
> system would have, so the complexity and difficulty question does not end at the microkernel API.
>
> > Designing your system as if RPC is the be-all and end-all of all you do, and what the whole
> > system design is all about - that's just incredibly stupid.
>
> The problem with doing "sometimes" IPC is that when there exists IPC gubbins that're fast
> enough, _not_ using IPC is a downside for everything except performance.

I don't know what you're trying to say.

> And unless the
> appropriate tuning order is followed with the use of rigorous benchmarking, designing for
> performance is premature -- especially when performance is defined in terms of RDTSC.

Performance is defined in terms of end-user workload throughput. And that does not look any prettier for microkernels than their hand-picked "RDTSC" best case IPC or context switching microbenchmarks.

>
> > Because yes, the practical problems are huge and in many
> > cases basically insurmountable. Your memory manager
> > really does want to interact with your filesystems pretty
> > much directly, and they want to share data structures
> > so that they can do simple things like "I still use this page"
> > (ie memory management uses it for mapping, filesystem
> > uses it for reading or whatever) without sending a effing message to each other, for chrissake.
> >
>
> On the contrary, many of the practical problems can be surmounted with brute programmer effort.

Citation required.

> But that's a recipe for not crossing the finish line, so the essential point remains.
>
> In the specific case of filesystems interacting with VM and block devices however, a design based
> on separation of concerns applies: the filesystem deals with on-disk metadata and inode lifecycle,
> VM deals with page caching and memory management, and block devices deal with the driver state machine
> i.e. the request queue, interrupts, and I/O commands. This boils down to a model where an user process
> does send a read(2) equivalent to a filesystem and wait for a reply (in the space of a single microkernel
> syscall, though that's irrelevant), but counterintuitively the filesystem then, having validated the
> client and file handle, translates the handle/range pair into a inode/range/block-address triple,
> hands them over to the VM task, and bows out of the transaction thereafter.
>
> The VM task respectively either finds the affected pages in the cache and replies the read()
> call directly, or spawns block device operations to the effect of reading data right into the
> page cache and replies later, all without obligating more copies than the one to userspace.

The problem is not the one single copy to userspace in the system call. Of course a microkernel is not going to do more than a monolithic kernel across the system call boundary.

The problem is the interaction between those pieces. And no it's not as simple as you say. If the page cache page does exist, the filesystem may need to be called still in order to validate it, or to do a delayed allocation for it. if it does not exist things are even worse, the filesystem will require multiple calls to and from memory manager and block device to allocate new cache and read in data structures and add things to page cache.

>
> That all being said, that "copy to userspace" is still a copy between two userspace
> processes and therefore either slow or a bit of a research project on its own. And
> this only amounts to fewer IPC transactions compared to a naïve model.
>
> Rather, the speculated gain from filesystem-and-VM separation is related to the handling of metadata. As
> is well known, path resolution in POSIX may cross filesystem boundaries back and forth due to arbitrarily-complex
> symlinkage. This results in either a simplistic locking model that underperforms in a concurrent system,
> a massively complicated one with tricks and traps aplenty, or brain damage (I may be talking about soft
> updates here, but perhaps not). Microkernel organization where path resolution operates analoguously through
> hand-off IPC is amenable to a fourth model utilizing a combination of transactional memory and distributed
> transaction brokering; thereby yielding an optimistically completing path resolution method that, while
> ugly in the source and incurring plenty of overhead even within a single filesystem task, requires only
> linear brute programmer effort compared to the exponential fuckery of "add more locks".
>
> Of course this exists only as a draft design in my little drawer of horrors, but the gist of it is sound.
> It's also testable on account of the transaction mechanism being testable where "lots of locks" will
> be full of surprises until validated through use, a property that resets to zero when modified.

This is all entirely handwavy. I find it strange that you can be so sure of yourself that you reply to Linus about this kind of topic saying that he is wrong and you have solved it, despite no real world evidence of any microkernels in existence which have actually solved it.

None of what you wrote consitutes any evidence or even a counter-example at all really. You enumerate some vague problems and handwave possible solutions to them, and imply that these are the sum of what is required to address all concerns with microkernel performance in the mm/filesystem/block layer, which nobody else has ever solved.

>
> > But latency can be a big deal in other situations - think
> > a network driver or a block driver where we're talking
> > below microsecond latencies on modern hardware.
>
> Userspace I/O is something that monolithic kernels too must deal with eventually as DMA speeds
> approach and exceed 100 gig per second. If the solution doesn't boil down to "memory mapped
> buffers and a synchronization method", I will eat my hat (I don't wear a hat).

Avoiding the kernel always has been and always will be attractive for the very highest performance devices. I'll tell you what the actual performant model does *not* look anything like though, is a message passing microkernel.

>
> > All these things have tons of subtle interconnects that
> > are not about "I'm sending a request to you". They
> > are very much about co-operating with each other, and you want to have a shared request queue between the
> > different pieces, you want to have visibility into (some
> > of) the scheduler state, you want to have all these
> > things where you just look at what is going on (and perhaps use an atomic op to change said state).
> >
>
> The question of sharing state between scheduler and polling driver comes down to the
> specified interface, whether it's possible to implement it in terms of the microkernel's
> primitives, and whether the solution breaks things elsewhere. Certainly the top-down
> hygiene dictated by academic microkernels is born of an era before such concerns.
>
> > The filesystem code might also easily want to know whether the initiating
> > user process might have a signal pending - maybe it can still cancel the
> > whole thing if it can tell that the originating process is being killed.
> >
>
> The pending signal masks of all processes can be mapped into a concerned filesystem's address
> space at mount time and accessed at only slightly more TLB miss cost than in a monolithic kernel.
> I don't see how this would be useful in practice unless signal-provoked process death (or another
> form of cancellation) becomes part of (say) an elaborate cloud storage stack.
>
> > So the whole belief that "user process sends a message to a filesystem process, and
> > waits for the reply" is simply not true. Or rather, it's only true at such a high
> > level that if you only see that part, you've lost sight of the underlying reality.
> >
>
> As beliefs go, it's not quite from the short-bus end of the pool. Hallucinations of the faithful aside.
>
> > It was always insane to think that these pieces were unrelated and should be separate things,
> > and only communicate with each other over some very limited and idealized channel.
> >
> > Why is this even a discussion any more? Microkernels failed.
> > Give them up. You want a monolithic kernel. End of story.
> >
>
> Mach failed. Hurd is going nowhere fast. Windows NT, MacOS X, DragonflyBSD, and QNX persist. Microkernels
> in the 21st century are a topic of active experimental research, to be continued.
>
> Even Mach's failure is debatable; it did spawn another generation's worth of research into
> not sucking like Mach did. Hardware has also changed from the '486 where performance was a
> matter of running the fewest instructions and pipeline massaging the ones that do get run.

It's not debateable. You can learn from failures just fine, but Mach was absolutely intended to be a competitive production operating system kernel, and it is not.

>
> > Yes, that monolithic kernel can then do RPC for the cases where that then makes sense.
> >
>
> But its mechanism will be necessarily worse due to the overhead of also implementing POSIX on the side,
> so that in practice many cases where in a multiserver microkernel design it makes every kind of sense
> to implement a feature in a distinct server will appear as built-ins in the monolithic version.
>
> > If you really live in that kind of world, where SMP and cache coherence failed and
> > will never succeed, and machines instead have thousands of nodes that all communicate
> > fundamentally using message passing, then microkernels might make sense.
> >
>
> This is a world where threads and forks (per MAP_SHARED) don't exist. The Transputer
> was structurally incapable of supporting all of POSIX, a curiosity of the 1980s like
> the rest of the single-language computers. Cache coherency is the hardware designer's
> solution to concurrent access, and it's efficient and in the right place.
>
> That being said, as historical curiosities go, the stack machine computer
> that ran only Ada was, in all its single-processor glory, cool as heck.
>
> > But this thread somehow devolved into then discussing whether
> > IPC is crazy talk. No, obviously not. But IPC being
> > a valid and sane thing to do does not equate to microkernels being a valid and sane thing to do. See?
> >
> > Linus
>
> From a perspective where issues of developer scaling (as in, say, path resolution) are solved by directing a
> horde of volunteers to each take a crack at it until something workable emerges, I should say that achieving
> 1% of where Linux is today using less than 1% the work would validate microkernels in the developer convenience
> department. As for sanity, well, there are those who put a saddle on that snake and yell "giddyup!".

Doing it as a hobby is great, microkernels are a very interesting, fun, different and a whole world of problems stretches out before you when you look at them. But don't fool yourself or others into thinking you have it all solved (or what you can't solve doesn't really matter). The reality is almost certainly not the case. Many people smarter than you or I haven't managed to strap enough rockets onto that pig.

You might get upset by Linus' "shots fired" way of dismissing microkernels. I would suggest not letting it bother you too much, because at this point he kind of gets to. And replying with handwaving rather than cold hard results kind of comes across like just another microkernel quack of the past 30 years who come and gone, who insisted Linus was wrong and they were right. The only thing at this point that will have credibility to prove the point is code, and results.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
x86 - why unite when you can fragment?anonymou52021/03/12 06:16 PM
  x86 - why unite when you can fragment?Linus Torvalds2021/03/13 01:18 PM
    x86 - why unite when you can fragment?Jon Masters2021/03/13 07:25 PM
      x86 - why unite when you can fragment?Jon Masters2021/03/13 07:44 PM
        x86 - why unite when you can fragment?Yuhong Bao2021/03/13 08:49 PM
        x86 - why unite when you can fragment?tt2021/03/20 09:30 AM
    x86 - why unite when you can fragment?Andrey2021/03/14 04:15 PM
      x86 - why unite when you can fragment?Linus Torvalds2021/03/14 04:58 PM
        x86 - why unite when you can fragment?anonymou52021/03/14 05:31 PM
          x86 - why unite when you can fragment?anon22021/03/14 08:07 PM
        Microkernel?Anon2021/03/14 11:49 PM
          Microkernel?none2021/03/15 12:37 AM
            Microkernel?Anon2021/03/15 01:56 AM
          Microkernel?anon22021/03/15 01:58 AM
            Microkernel?Simon Farnsworth2021/03/15 03:12 AM
              Microkernel?anon22021/03/15 04:53 AM
                Microkernel?Simon Farnsworth2021/03/15 06:56 AM
                  Microkernel?iz2021/03/15 08:10 AM
                    Microkernel?Anon2021/03/15 09:05 AM
                      Microkernel?iz2021/03/16 01:25 AM
                        Microkernel?Andrey2021/03/16 02:54 AM
                          Microkernel?iz2021/03/16 08:36 AM
                            Microkernel?Andrey2021/03/16 10:06 AM
                              Microkernel?anonymou52021/03/16 11:44 AM
                              Microkernel?iz2021/03/21 02:58 AM
                                Microkernel?Andrey2021/03/21 09:34 AM
                  Microkernel?anon22021/03/15 08:31 AM
                    Microkernel?Simon Farnsworth2021/03/16 04:42 AM
            Microkernel?Gabriele Svelto2021/03/15 03:21 AM
              Microkernel?anon22021/03/15 04:56 AM
                Microkernel?Gabriele Svelto2021/03/15 10:41 AM
                  Microkernel?anon22021/03/15 08:00 PM
                    Microkernel?Gabriele Svelto2021/03/16 07:23 AM
                      Microkernel?anon22021/03/16 05:13 PM
                        Microkernel?anon22021/03/16 05:16 PM
                    Microkernel?Gian-Carlo Pascutto2021/03/16 01:40 PM
                      Microkernel?anon22021/03/16 05:53 PM
                        Microkernel?Linus Torvalds2021/03/16 07:25 PM
                          Microkernel?Doug S2021/03/17 09:30 AM
                            Microkernel?Linus Torvalds2021/03/17 10:30 AM
                              Microkernel?Brendan2021/03/17 10:56 PM
                                Microkernel?Michael S2021/03/18 03:47 AM
                                  Microkernel?Brendan2021/03/18 09:07 AM
                              Microkernel?Jose2021/03/18 09:35 AM
                            Microkernel?zArchJon2021/03/18 05:42 PM
                          TransputerRichardC2021/03/17 09:47 AM
                          Microkernel?dmcq2021/03/17 11:15 AM
                            Microkernel?Linus Torvalds2021/03/17 11:59 AM
                              Microkernel?dmcq2021/03/17 12:38 PM
                              Microkernel?Adrian2021/03/17 01:00 PM
                              Microkernel?Ana R. Riano2021/03/18 04:33 AM
                              Microkernel?2021/04/30 04:52 PM
                          Microkernel?NvaxPlus2021/03/17 11:48 AM
                            Microkernel?Michael S2021/03/18 03:32 AM
                              Microkernel?Adrian2021/03/18 04:12 AM
                                Microkernel?dmcq2021/03/18 06:30 AM
                                  Microkernel?dmcq2021/03/18 06:55 AM
                                  Microkernel?Adrian2021/03/18 08:35 AM
                                    Microkernel?---2021/03/18 09:49 AM
                                    Microkernel?dmcq2021/03/18 10:59 AM
                                      Microkernel?dmcq2021/03/18 04:09 PM
                              Microkernel?---2021/03/18 09:27 AM
                          Microkernel?Kalle A. Sandström2021/03/20 06:34 AM
                            Microkernel?---2021/03/20 08:35 AM
                            Microkernel?anon22021/03/21 05:29 PM
            Microkernel?dmcq2021/03/15 04:06 AM
              Microkernel?anon22021/03/15 04:59 AM
                Microkernel?dmcq2021/03/15 11:51 AM
                  Microkernel?anon22021/03/15 08:31 PM
                    Microkernel?dmcq2021/03/16 09:17 AM
                      Microkernel?Jukka Larja2021/03/16 11:22 AM
                        Microkernel?dmcq2021/03/16 04:06 PM
                          Microkernel?Jukka Larja2021/03/17 03:42 AM
                            Microkernel?dmcq2021/03/17 07:00 AM
                      Microkernel?anon22021/03/16 05:26 PM
                    Microkernel?---2021/03/16 10:07 AM
            Microkernel?-.-2021/03/15 08:15 PM
              Microkernel?anon22021/03/15 09:18 PM
                Microkernel?Foo_2021/03/16 03:37 AM
                  Read the thread (NT)anon22021/03/16 05:27 PM
                    Already did (NT)Foo_2021/03/17 02:55 AM
                      Already didanon22021/03/17 03:46 AM
                        Already didEtienne Lorrain2021/03/18 02:31 AM
                Microkernel?-.-2021/03/17 05:04 AM
                  Microkernel?Gabriele Svelto2021/03/17 08:53 AM
                    Microkernel?-.-2021/03/17 02:43 PM
              Microkernel?dmcq2021/03/16 08:40 AM
        x86 - why unite when you can fragment?Konrad Schwarz2021/03/17 10:19 AM
    x86 - why unite when you can fragment?anonon2021/03/15 07:37 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?