By: Brendan (btrotter.delete@this.gmail.com), May 15, 2006 4:00 am
Room: Moderated Discussions
Hi,
nick (anon@anon.com) on 5/15/06 wrote:
>Brendan (btrotter@gmail.com) on 5/14/06 wrote:
>>nick (anon@anon.com) on 5/14/06 wrote:
>
>>>Doesn't matter, if they have to both take the lock ->
>>>cacheline bouncing; if one CPU writes a messages which is
>>>read by another -> cacheline bouncing.
>>
>>That would involve a cache line fill. It's not even close to the performance effects
>>of lock contention (which either involves repeated cache
>
>No. I told you, you're 10 years out of date. Lock
>contention is largely not an issue any more, it is cache
>misses "cache line fill". Read some of literature for
>reasons why there is increasing focus on lock-free
>algorithms and data structures.
So you tell me what's worse, cacheline bouncing caused by a message queue lock and nothing else, or cacheline bouncing caused by a device driver's lock in addition to more cacheline bouncing caused by modifying the device driver's internal data?
Of course theoretically the device driver's internal data might be lock-free or wait-free, but theoretically so could the message queue (and it'd probably be easier to develop one lock-free or wait-free algorithm for IPC than it would be to develop one or more for every device driver).
>You're first assuming that there is a large amount of
>lock contention in scalable OSes then concluding
>that, therefore, the relatively minor cost of cacheline
>contention is unimportant.
I assume that lock contention should be avoided in scalable OSs, and that cacheline contention is unimportant until lock contention is avoided. If minimizing lock contention wasn't important then we wouldn't be discussing the problems of a "big kernel lock".
>>Still, I'm getting "side-tracked". My original comment was about the time taken
>>by the longest kernel function. Consider a system where a CPU spends 50% of it's
>>time in the kernel on average, compared to a system where a CPU spends 1% of it's
>>time in the kernel on average. For the first system "big kernel lock contention"
>>is extremely likely (even for a 2-way system), while for the second system "big
>>kernel lock contention" is unlikely (even for 8-way SMP).
>
>I'm not talking about lock contention, I'm talking about
>cacheline contention. But if all else was equal, of course
>you're right.
>
>However, your 1% system may need to enter the kernel much
>more *frequently*, in which case the cacheline contention
>might be as bad or worse.
I'd suggest that it doesn't matter if that 1% is caused by entering the kernel 10 times (at 0.1% per time) or if it's caused by entering the kernel twice (at .5% per time) - it's still 1% regardless of how frequently you enter the kernel.
What you might be trying to say is that the CPU might actually spend 2% of it's time in kernel space because the kernel is entered more often, in which case you'd be right (but I don't see how this makes a fundamental difference). In general, if the time spent inside a micro-kernel is equal to the amount of time spent inside a monolithic kernel, then either the micro-kernel designer needs to be shot (or the monolithic kernel designer deserves several awards for outstanding acheivements).
>>For monolithic systems the time spent in the kernel includes the time spent within
>>all device drivers, so the time spent inside the kernel is going to be greater than
>>an equivelent micro-kernel (where time spent in the kernel doesn't include the time
>>spent within device drivers). Therefore using a big kernel lock is going to effect
>>performance/scalability in a monolithic system much more than it would effect performance/scalability
>>in an equivelent micro-kernel.
>
>Yes. But nobody would call a monolithic big lock kernel
>scalable, so I don't think you can transitively conclude
>anything about the microkernel.
>
>>>In Linux, the control data does not need to go across
>>>node because it is basically kept in the stack of the
>>>running process. Instruction text is a different matter,
>>>but AFAIK, that can be replicated on the big NUMA machines
>>>in Linux.
>>
>>My last prototype used multiple copies of the kernel (one for each NUMA domain)
>>and my current prototype uses multiple copies of the kernel and multiple copies
>>of "static kernel data" (anything that doesn't change after boot). My kernels are
>>normally less than 64 KB though so the memory costs aren't too bad and it's still
>>sensible when the "across node penalty" isn't too high (e.g. 2 or 4 way Opterons).
>>Linux is typically around 3 MB so you might get better results using this memory for disk cache instead. :-)
>
>Well are you replecating the text of your servers?
No - they are independant processes that use CPU affinity to ensure that they are always run on the same NUMA domain...
>Considering that nobody in Linux even cares that much
>about it except the guys with 1024 CPU systems, I'm
>guessing it is completely unmeasurable on your kernel
>(outside microbenchmarks, maybe). :-)
Given that both AMD and Intel are increasing the number of cores rather than increasing core frequency (and that I predict Intel will be shifting to something like hypertransport/NUMA in the near future), the number of people who care about it is probably going to increase a lot by the time it matters to me.
My work consists of a series of prototypes, where each prototype builds on the last. The newest prototype uses a "modular micro-kernel", is 32 bit and 64 bit and is designed to scale to large NUMA systems. I've basically reached the end of the series of prototypes (there's nothing left to add and the worst of the bottlenecks are gone). With some luck, my current prototype will become the basis for an OS. I'm expecting it to take another 3 years before I've got a bare working system running on legacy hardware, but it's too different to port applications (or drivers) to it and it'll probably take 10 years or more before it's actually usable. I knew this before I started, which is why I've spent so much time making sure the kernel design is "right".
Anyway, real world benchmarks (like comparing web server and database performance) is a long way off...
>I'd wager that passing messages across interconnect would
>be more interesting...
Passing messages between NUMA domains will be slower, but it's designed for passing messages across a LAN so I doubt NUMA domain boundaries are going to matter much in comparison.
Cheers,
Brendan
nick (anon@anon.com) on 5/15/06 wrote:
>Brendan (btrotter@gmail.com) on 5/14/06 wrote:
>>nick (anon@anon.com) on 5/14/06 wrote:
>
>>>Doesn't matter, if they have to both take the lock ->
>>>cacheline bouncing; if one CPU writes a messages which is
>>>read by another -> cacheline bouncing.
>>
>>That would involve a cache line fill. It's not even close to the performance effects
>>of lock contention (which either involves repeated cache
>
>No. I told you, you're 10 years out of date. Lock
>contention is largely not an issue any more, it is cache
>misses "cache line fill". Read some of literature for
>reasons why there is increasing focus on lock-free
>algorithms and data structures.
So you tell me what's worse, cacheline bouncing caused by a message queue lock and nothing else, or cacheline bouncing caused by a device driver's lock in addition to more cacheline bouncing caused by modifying the device driver's internal data?
Of course theoretically the device driver's internal data might be lock-free or wait-free, but theoretically so could the message queue (and it'd probably be easier to develop one lock-free or wait-free algorithm for IPC than it would be to develop one or more for every device driver).
>You're first assuming that there is a large amount of
>lock contention in scalable OSes then concluding
>that, therefore, the relatively minor cost of cacheline
>contention is unimportant.
I assume that lock contention should be avoided in scalable OSs, and that cacheline contention is unimportant until lock contention is avoided. If minimizing lock contention wasn't important then we wouldn't be discussing the problems of a "big kernel lock".
>>Still, I'm getting "side-tracked". My original comment was about the time taken
>>by the longest kernel function. Consider a system where a CPU spends 50% of it's
>>time in the kernel on average, compared to a system where a CPU spends 1% of it's
>>time in the kernel on average. For the first system "big kernel lock contention"
>>is extremely likely (even for a 2-way system), while for the second system "big
>>kernel lock contention" is unlikely (even for 8-way SMP).
>
>I'm not talking about lock contention, I'm talking about
>cacheline contention. But if all else was equal, of course
>you're right.
>
>However, your 1% system may need to enter the kernel much
>more *frequently*, in which case the cacheline contention
>might be as bad or worse.
I'd suggest that it doesn't matter if that 1% is caused by entering the kernel 10 times (at 0.1% per time) or if it's caused by entering the kernel twice (at .5% per time) - it's still 1% regardless of how frequently you enter the kernel.
What you might be trying to say is that the CPU might actually spend 2% of it's time in kernel space because the kernel is entered more often, in which case you'd be right (but I don't see how this makes a fundamental difference). In general, if the time spent inside a micro-kernel is equal to the amount of time spent inside a monolithic kernel, then either the micro-kernel designer needs to be shot (or the monolithic kernel designer deserves several awards for outstanding acheivements).
>>For monolithic systems the time spent in the kernel includes the time spent within
>>all device drivers, so the time spent inside the kernel is going to be greater than
>>an equivelent micro-kernel (where time spent in the kernel doesn't include the time
>>spent within device drivers). Therefore using a big kernel lock is going to effect
>>performance/scalability in a monolithic system much more than it would effect performance/scalability
>>in an equivelent micro-kernel.
>
>Yes. But nobody would call a monolithic big lock kernel
>scalable, so I don't think you can transitively conclude
>anything about the microkernel.
>
>>>In Linux, the control data does not need to go across
>>>node because it is basically kept in the stack of the
>>>running process. Instruction text is a different matter,
>>>but AFAIK, that can be replicated on the big NUMA machines
>>>in Linux.
>>
>>My last prototype used multiple copies of the kernel (one for each NUMA domain)
>>and my current prototype uses multiple copies of the kernel and multiple copies
>>of "static kernel data" (anything that doesn't change after boot). My kernels are
>>normally less than 64 KB though so the memory costs aren't too bad and it's still
>>sensible when the "across node penalty" isn't too high (e.g. 2 or 4 way Opterons).
>>Linux is typically around 3 MB so you might get better results using this memory for disk cache instead. :-)
>
>Well are you replecating the text of your servers?
No - they are independant processes that use CPU affinity to ensure that they are always run on the same NUMA domain...
>Considering that nobody in Linux even cares that much
>about it except the guys with 1024 CPU systems, I'm
>guessing it is completely unmeasurable on your kernel
>(outside microbenchmarks, maybe). :-)
Given that both AMD and Intel are increasing the number of cores rather than increasing core frequency (and that I predict Intel will be shifting to something like hypertransport/NUMA in the near future), the number of people who care about it is probably going to increase a lot by the time it matters to me.
My work consists of a series of prototypes, where each prototype builds on the last. The newest prototype uses a "modular micro-kernel", is 32 bit and 64 bit and is designed to scale to large NUMA systems. I've basically reached the end of the series of prototypes (there's nothing left to add and the worst of the bottlenecks are gone). With some luck, my current prototype will become the basis for an OS. I'm expecting it to take another 3 years before I've got a bare working system running on legacy hardware, but it's too different to port applications (or drivers) to it and it'll probably take 10 years or more before it's actually usable. I knew this before I started, which is why I've spent so much time making sure the kernel design is "right".
Anyway, real world benchmarks (like comparing web server and database performance) is a long way off...
>I'd wager that passing messages across interconnect would
>be more interesting...
Passing messages between NUMA domains will be slower, but it's designed for passing messages across a LAN so I doubt NUMA domain boundaries are going to matter much in comparison.
Cheers,
Brendan
Topic | Posted By | Date |
---|---|---|
Hybrid (micro)kernels | Tzvetan Mikov | 2006/05/08 04:41 PM |
Hybrid (micro)kernels | S. Rao | 2006/05/08 06:14 PM |
Hybrid (micro)kernels | Bill Todd | 2006/05/08 06:16 PM |
Hybrid (micro)kernels | Tzvetan Mikov | 2006/05/08 07:21 PM |
Hybrid (micro)kernels | nick | 2006/05/08 07:50 PM |
Hybrid (micro)kernels | Bill Todd | 2006/05/09 01:26 AM |
There aren't enough words... | Rob Thorpe | 2006/05/09 02:39 AM |
There aren't enough words... | Tzvetan Mikov | 2006/05/09 03:10 PM |
There aren't enough words... | Rob Thorpe | 2006/05/15 12:25 AM |
Hybrid (micro)kernels | Tzvetan Mikov | 2006/05/09 11:17 AM |
Hybrid (micro)kernels | Bill Todd | 2006/05/09 04:05 PM |
Hybrid (micro)kernels | rwessel | 2006/05/08 11:23 PM |
Hybrid kernel, not NT | Richard Urich | 2006/05/09 06:03 AM |
Hybrid kernel, not NT | _Arthur | 2006/05/09 07:06 AM |
Hybrid kernel, not NT | Rob Thorpe | 2006/05/09 07:40 AM |
Hybrid kernel, not NT | _Arthur | 2006/05/09 08:30 AM |
Hybrid kernel, not NT | Rob Thorpe | 2006/05/09 09:07 AM |
Hybrid kernel, not NT | _Arthur | 2006/05/09 09:36 AM |
Linux vs MacOSX peformance, debunked | _Arthur | 2006/05/18 07:30 AM |
Linux vs MacOSX peformance, debunked | Rob Thorpe | 2006/05/18 08:19 AM |
Linux vs MacOSX peformance, debunked | Anonymous | 2006/05/18 12:31 PM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/09 08:16 AM |
Hybrid kernel, not NT | Andi Kleen | 2006/05/09 02:32 PM |
Hybrid kernel, not NT | myself | 2006/05/09 03:24 PM |
Hybrid kernel, not NT | myself | 2006/05/09 03:41 PM |
Hybrid kernel, not NT | Brendan | 2006/05/09 05:26 PM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/09 08:06 PM |
Hybrid kernel, not NT | Brendan | 2006/05/13 01:35 AM |
Hybrid kernel, not NT | nick | 2006/05/13 04:40 AM |
Hybrid kernel, not NT | Brendan | 2006/05/13 09:48 AM |
Hybrid kernel, not NT | nick | 2006/05/13 07:41 PM |
Hybrid kernel, not NT | Brendan | 2006/05/13 09:51 PM |
Hybrid kernel, not NT | nick | 2006/05/14 05:57 PM |
Hybrid kernel, not NT | Brendan | 2006/05/14 10:40 PM |
Hybrid kernel, not NT | nick | 2006/05/14 11:46 PM |
Hybrid kernel, not NT | Brendan | 2006/05/15 04:00 AM |
Hybrid kernel, not NT | rwessel | 2006/05/15 07:21 AM |
Hybrid kernel, not NT | Brendan | 2006/05/15 08:55 AM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/15 09:49 AM |
Hybrid kernel, not NT | nick | 2006/05/15 04:41 PM |
Hybrid kernel, not NT | tony roth | 2008/01/31 02:20 PM |
Hybrid kernel, not NT | nick | 2006/05/15 06:33 PM |
Hybrid kernel, not NT | Brendan | 2006/05/16 01:39 AM |
Hybrid kernel, not NT | nick | 2006/05/16 02:53 AM |
Hybrid kernel, not NT | Brendan | 2006/05/16 05:37 AM |
Hybrid kernel, not NT | Anonymous | 2008/05/01 10:31 PM |
Following the structure of the tree | Michael S | 2008/05/02 04:19 AM |
Following the structure of the tree | Dean Kent | 2008/05/02 05:31 AM |
Following the structure of the tree | Michael S | 2008/05/02 06:02 AM |
Following the structure of the tree | David W. Hess | 2008/05/02 06:48 AM |
Following the structure of the tree | Dean Kent | 2008/05/02 09:14 AM |
Following the structure of the tree | David W. Hess | 2008/05/02 10:05 AM |
LOL! | Dean Kent | 2008/05/02 10:33 AM |
Following the structure of the tree | anonymous | 2008/05/02 03:04 PM |
Following the structure of the tree | Dean Kent | 2008/05/02 07:52 PM |
Following the structure of the tree | Foo_ | 2008/05/03 02:01 AM |
Following the structure of the tree | David W. Hess | 2008/05/03 06:54 AM |
Following the structure of the tree | Dean Kent | 2008/05/03 10:06 AM |
Following the structure of the tree | Foo_ | 2008/05/04 01:06 AM |
Following the structure of the tree | Michael S | 2008/05/04 01:22 AM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/09 05:19 PM |
Microkernel Vs Monolithic Kernel | Kernel_Protector | 2006/05/09 09:41 PM |
Microkernel Vs Monolithic Kernel | David Kanter | 2006/05/09 10:30 PM |
Sigh, Stand back, its slashdotting time. (NT) | Anonymous | 2006/05/09 10:44 PM |
Microkernel Vs Monolithic Kernel | blah | 2006/05/12 08:58 PM |
Microkernel Vs Monolithic Kernel | Rob Thorpe | 2006/05/15 01:41 AM |
Hybrid kernel, not NT | AnalGuy | 2006/05/16 03:10 AM |
Theory versus practice | David Kanter | 2006/05/16 12:55 PM |
Distributed algorithms | Rob Thorpe | 2006/05/17 12:53 AM |
Theory versus practice | Howard Chu | 2006/05/17 02:54 AM |
Theory versus practice | JS | 2006/05/17 04:29 AM |
Play online poker, blackjack !!! | Gamezonex | 2007/08/16 01:49 PM |
Hybrid kernel, not NT (NT) | atle rene mossik | 2020/12/12 09:31 AM |
Hybrid (micro)kernels | philt | 2006/05/14 09:15 PM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/15 08:20 AM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/15 11:56 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/16 01:22 AM |
Hybrid (micro)kernels | rwessel | 2006/05/16 11:23 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/17 12:43 AM |
Hybrid (micro)kernels | rwessel | 2006/05/17 01:33 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/19 07:51 AM |
Hybrid (micro)kernels | rwessel | 2006/05/19 12:27 PM |
Hybrid (micro)kernels | techIperson | 2006/05/15 01:25 PM |
Hybrid (micro)kernels | mas | 2006/05/15 05:17 PM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/15 05:39 PM |
Hybrid (micro)kernels | Colonel Kernel | 2006/05/15 09:17 PM |
Hybrid (micro)kernels | Wink Saville | 2006/05/15 10:31 PM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/16 10:08 AM |
Hybrid (micro)kernels | Wink Saville | 2006/05/16 09:55 PM |
Hybrid (micro)kernels | rwessel | 2006/05/16 11:31 AM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/16 12:00 PM |
Hybrid (micro)kernels | Brendan | 2006/05/16 01:36 AM |
Hybrid (micro)kernels | Paul Elliott | 2006/09/03 08:44 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/09/04 09:25 AM |
Hybrid (micro)kernels | philt | 2006/05/16 12:55 AM |
Hybrid (micro)kernels | pgerassi | 2007/08/16 07:41 PM |
Another questionable entry on Wikipedia? | Chung Leong | 2006/05/18 10:33 AM |
Hybrid (micro)kernels | israel | 2006/05/20 04:25 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/22 08:35 AM |