By: Linus Torvalds (torvalds.delete@this.osdl.org), May 15, 2006 9:49 am
Room: Moderated Discussions
Brendan (btrotter@gmail.com) on 5/15/06 wrote:
>
>I _do_ understand this, I just don't see how seperating
>everything so that only one process has access to each
>"thing" would make scalability suck or make me 10 years
>behind.
Seperating some things is a wonderful thing, and not ten
years behind. Almost all lockless algorithms depend on
per-cpu data structures, and that's fine.
The problem is doing it for "everything". It doesn't work.
Doing it for the filesystem caches, for example, is a total
disaster. It increases memory pressure by huge amounts, and
makes synchronization much harder.
The synchronization issue isn't necessarily a huge deal
for regular file data (writes are much less common than
reads, and synchronization rules are fairly lax), but that
is not true for most filesystem metadata, for example.
In particular, under many loads, the bulk of the cached
filesystem data ends up being not the file data itself, but
the directory and inode information. And that also tends
to be the hottest part.
You may have thought that "read()" was performance
sensitive? Nope. Doing a "stat()" is under many loads the
much more performance-sensitive operation, and there the
name and inode lookup is the biggest deal, and there the
synchronization requirements are also a lot more strict
(ie you simply cannot allow two processes to create
the same filename. That would be instant bad OS karma).
So replication simply isn't a generally acceptable model.
It works fine sometimes, and when it works, it's
wonderful, and monolithic kernels will certainly do it too.
It's just not a "generic" solution.
In many other cases, replication is absolutely the last
thing you'd ever want to do, either because it doesn't
buy you anything (read-only data caches wonderfully), or
because the modification costs go up way too much. If you
write to it even occasionally, the cost of spreading out
the writes can be absolutely prohibitive.
>Cacheline contention caused by the message queue lock in
>a micro-kernel will be as bad as cachline contention
>caused by a device driver's own lock in a system where
>the device driver runs in the context of some other
>process, so swapping one for the other doesn't seem to
>make any difference to me.
Generally, you want very hard to run all queuing on the
same CPU that the service end result is needed on. The
lock ends up fairly expensive (and forget about contention:
contention doesn't really exist much in real loads for
good locking).
The expense comes from just the CPU having to serialize
enough for the locking to actually be effective. But at
least you can try to avoid having the cacheline move away,
either for the lock, or for the actual data being modified
(notably the pointers for the queues).
That said, you'll get cache bouncing at any time you have
multiple processes touching the same thing, that just isn't
avoidable. The best you can try to do is to have CPU
affinity, and make sure that as much as possible of the
setup is run in the same context (ie if you absolutely have
to have a "service thread", make sure that it is run on the
same CPU, and preferably with the same TLB as the process
it serves).
>If anything, a system where the device driver runs in the
>context of some other process would have worse cacheline
>contention and worse scalability because of the
>device driver accessing it's own local data from any CPU
>that happened to run it.
What local data?
Last I saw, there's basically no local data except for
the data that needs to be shared anyway (ie, the queue),
and the data you end up having to work with (which is
coming from the CPU that happened to run it).
Linus
>
>I _do_ understand this, I just don't see how seperating
>everything so that only one process has access to each
>"thing" would make scalability suck or make me 10 years
>behind.
Seperating some things is a wonderful thing, and not ten
years behind. Almost all lockless algorithms depend on
per-cpu data structures, and that's fine.
The problem is doing it for "everything". It doesn't work.
Doing it for the filesystem caches, for example, is a total
disaster. It increases memory pressure by huge amounts, and
makes synchronization much harder.
The synchronization issue isn't necessarily a huge deal
for regular file data (writes are much less common than
reads, and synchronization rules are fairly lax), but that
is not true for most filesystem metadata, for example.
In particular, under many loads, the bulk of the cached
filesystem data ends up being not the file data itself, but
the directory and inode information. And that also tends
to be the hottest part.
You may have thought that "read()" was performance
sensitive? Nope. Doing a "stat()" is under many loads the
much more performance-sensitive operation, and there the
name and inode lookup is the biggest deal, and there the
synchronization requirements are also a lot more strict
(ie you simply cannot allow two processes to create
the same filename. That would be instant bad OS karma).
So replication simply isn't a generally acceptable model.
It works fine sometimes, and when it works, it's
wonderful, and monolithic kernels will certainly do it too.
It's just not a "generic" solution.
In many other cases, replication is absolutely the last
thing you'd ever want to do, either because it doesn't
buy you anything (read-only data caches wonderfully), or
because the modification costs go up way too much. If you
write to it even occasionally, the cost of spreading out
the writes can be absolutely prohibitive.
>Cacheline contention caused by the message queue lock in
>a micro-kernel will be as bad as cachline contention
>caused by a device driver's own lock in a system where
>the device driver runs in the context of some other
>process, so swapping one for the other doesn't seem to
>make any difference to me.
Generally, you want very hard to run all queuing on the
same CPU that the service end result is needed on. The
lock ends up fairly expensive (and forget about contention:
contention doesn't really exist much in real loads for
good locking).
The expense comes from just the CPU having to serialize
enough for the locking to actually be effective. But at
least you can try to avoid having the cacheline move away,
either for the lock, or for the actual data being modified
(notably the pointers for the queues).
That said, you'll get cache bouncing at any time you have
multiple processes touching the same thing, that just isn't
avoidable. The best you can try to do is to have CPU
affinity, and make sure that as much as possible of the
setup is run in the same context (ie if you absolutely have
to have a "service thread", make sure that it is run on the
same CPU, and preferably with the same TLB as the process
it serves).
>If anything, a system where the device driver runs in the
>context of some other process would have worse cacheline
>contention and worse scalability because of the
>device driver accessing it's own local data from any CPU
>that happened to run it.
What local data?
Last I saw, there's basically no local data except for
the data that needs to be shared anyway (ie, the queue),
and the data you end up having to work with (which is
coming from the CPU that happened to run it).
Linus
Topic | Posted By | Date |
---|---|---|
Hybrid (micro)kernels | Tzvetan Mikov | 2006/05/08 04:41 PM |
Hybrid (micro)kernels | S. Rao | 2006/05/08 06:14 PM |
Hybrid (micro)kernels | Bill Todd | 2006/05/08 06:16 PM |
Hybrid (micro)kernels | Tzvetan Mikov | 2006/05/08 07:21 PM |
Hybrid (micro)kernels | nick | 2006/05/08 07:50 PM |
Hybrid (micro)kernels | Bill Todd | 2006/05/09 01:26 AM |
There aren't enough words... | Rob Thorpe | 2006/05/09 02:39 AM |
There aren't enough words... | Tzvetan Mikov | 2006/05/09 03:10 PM |
There aren't enough words... | Rob Thorpe | 2006/05/15 12:25 AM |
Hybrid (micro)kernels | Tzvetan Mikov | 2006/05/09 11:17 AM |
Hybrid (micro)kernels | Bill Todd | 2006/05/09 04:05 PM |
Hybrid (micro)kernels | rwessel | 2006/05/08 11:23 PM |
Hybrid kernel, not NT | Richard Urich | 2006/05/09 06:03 AM |
Hybrid kernel, not NT | _Arthur | 2006/05/09 07:06 AM |
Hybrid kernel, not NT | Rob Thorpe | 2006/05/09 07:40 AM |
Hybrid kernel, not NT | _Arthur | 2006/05/09 08:30 AM |
Hybrid kernel, not NT | Rob Thorpe | 2006/05/09 09:07 AM |
Hybrid kernel, not NT | _Arthur | 2006/05/09 09:36 AM |
Linux vs MacOSX peformance, debunked | _Arthur | 2006/05/18 07:30 AM |
Linux vs MacOSX peformance, debunked | Rob Thorpe | 2006/05/18 08:19 AM |
Linux vs MacOSX peformance, debunked | Anonymous | 2006/05/18 12:31 PM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/09 08:16 AM |
Hybrid kernel, not NT | Andi Kleen | 2006/05/09 02:32 PM |
Hybrid kernel, not NT | myself | 2006/05/09 03:24 PM |
Hybrid kernel, not NT | myself | 2006/05/09 03:41 PM |
Hybrid kernel, not NT | Brendan | 2006/05/09 05:26 PM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/09 08:06 PM |
Hybrid kernel, not NT | Brendan | 2006/05/13 01:35 AM |
Hybrid kernel, not NT | nick | 2006/05/13 04:40 AM |
Hybrid kernel, not NT | Brendan | 2006/05/13 09:48 AM |
Hybrid kernel, not NT | nick | 2006/05/13 07:41 PM |
Hybrid kernel, not NT | Brendan | 2006/05/13 09:51 PM |
Hybrid kernel, not NT | nick | 2006/05/14 05:57 PM |
Hybrid kernel, not NT | Brendan | 2006/05/14 10:40 PM |
Hybrid kernel, not NT | nick | 2006/05/14 11:46 PM |
Hybrid kernel, not NT | Brendan | 2006/05/15 04:00 AM |
Hybrid kernel, not NT | rwessel | 2006/05/15 07:21 AM |
Hybrid kernel, not NT | Brendan | 2006/05/15 08:55 AM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/15 09:49 AM |
Hybrid kernel, not NT | nick | 2006/05/15 04:41 PM |
Hybrid kernel, not NT | tony roth | 2008/01/31 02:20 PM |
Hybrid kernel, not NT | nick | 2006/05/15 06:33 PM |
Hybrid kernel, not NT | Brendan | 2006/05/16 01:39 AM |
Hybrid kernel, not NT | nick | 2006/05/16 02:53 AM |
Hybrid kernel, not NT | Brendan | 2006/05/16 05:37 AM |
Hybrid kernel, not NT | Anonymous | 2008/05/01 10:31 PM |
Following the structure of the tree | Michael S | 2008/05/02 04:19 AM |
Following the structure of the tree | Dean Kent | 2008/05/02 05:31 AM |
Following the structure of the tree | Michael S | 2008/05/02 06:02 AM |
Following the structure of the tree | David W. Hess | 2008/05/02 06:48 AM |
Following the structure of the tree | Dean Kent | 2008/05/02 09:14 AM |
Following the structure of the tree | David W. Hess | 2008/05/02 10:05 AM |
LOL! | Dean Kent | 2008/05/02 10:33 AM |
Following the structure of the tree | anonymous | 2008/05/02 03:04 PM |
Following the structure of the tree | Dean Kent | 2008/05/02 07:52 PM |
Following the structure of the tree | Foo_ | 2008/05/03 02:01 AM |
Following the structure of the tree | David W. Hess | 2008/05/03 06:54 AM |
Following the structure of the tree | Dean Kent | 2008/05/03 10:06 AM |
Following the structure of the tree | Foo_ | 2008/05/04 01:06 AM |
Following the structure of the tree | Michael S | 2008/05/04 01:22 AM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/09 05:19 PM |
Microkernel Vs Monolithic Kernel | Kernel_Protector | 2006/05/09 09:41 PM |
Microkernel Vs Monolithic Kernel | David Kanter | 2006/05/09 10:30 PM |
Sigh, Stand back, its slashdotting time. (NT) | Anonymous | 2006/05/09 10:44 PM |
Microkernel Vs Monolithic Kernel | blah | 2006/05/12 08:58 PM |
Microkernel Vs Monolithic Kernel | Rob Thorpe | 2006/05/15 01:41 AM |
Hybrid kernel, not NT | AnalGuy | 2006/05/16 03:10 AM |
Theory versus practice | David Kanter | 2006/05/16 12:55 PM |
Distributed algorithms | Rob Thorpe | 2006/05/17 12:53 AM |
Theory versus practice | Howard Chu | 2006/05/17 02:54 AM |
Theory versus practice | JS | 2006/05/17 04:29 AM |
Play online poker, blackjack !!! | Gamezonex | 2007/08/16 01:49 PM |
Hybrid kernel, not NT (NT) | atle rene mossik | 2020/12/12 09:31 AM |
Hybrid (micro)kernels | philt | 2006/05/14 09:15 PM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/15 08:20 AM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/15 11:56 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/16 01:22 AM |
Hybrid (micro)kernels | rwessel | 2006/05/16 11:23 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/17 12:43 AM |
Hybrid (micro)kernels | rwessel | 2006/05/17 01:33 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/19 07:51 AM |
Hybrid (micro)kernels | rwessel | 2006/05/19 12:27 PM |
Hybrid (micro)kernels | techIperson | 2006/05/15 01:25 PM |
Hybrid (micro)kernels | mas | 2006/05/15 05:17 PM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/15 05:39 PM |
Hybrid (micro)kernels | Colonel Kernel | 2006/05/15 09:17 PM |
Hybrid (micro)kernels | Wink Saville | 2006/05/15 10:31 PM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/16 10:08 AM |
Hybrid (micro)kernels | Wink Saville | 2006/05/16 09:55 PM |
Hybrid (micro)kernels | rwessel | 2006/05/16 11:31 AM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/16 12:00 PM |
Hybrid (micro)kernels | Brendan | 2006/05/16 01:36 AM |
Hybrid (micro)kernels | Paul Elliott | 2006/09/03 08:44 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/09/04 09:25 AM |
Hybrid (micro)kernels | philt | 2006/05/16 12:55 AM |
Hybrid (micro)kernels | pgerassi | 2007/08/16 07:41 PM |
Another questionable entry on Wikipedia? | Chung Leong | 2006/05/18 10:33 AM |
Hybrid (micro)kernels | israel | 2006/05/20 04:25 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/22 08:35 AM |