By: Brendan (btrotter.delete@this.gmail.com), May 16, 2006 5:37 am
Room: Moderated Discussions
Hi,
nick (anon@anon.com) on 5/16/06 wrote:
>Brendan (btrotter@gmail.com) on 5/16/06 wrote:
>>nick (anon@anon.com) on 5/15/06 wrote:
>
>>>Do you use threads of a single memory space running on
>>>different nodes?
>>
>>Yes, but I split user space into "process space" and "thread space", such that
>>thread space can't be accessed from other threads. It's a little like "thread local
>>data" in POSIX, only implemented so that seperation is enforced. The disadvantage
>>is that switching between threads that belong to the same process involves changing
>>address spaces and is as expensive as switching between
>processes. The advantages
>>are that security of the threads local data is enforced, a thread's data doesn't
>>suffer from cacheline bouncing (or "across NUMA node" access penalties if the process
>>itself isn't tied to a specific NUMA node), the linear
>
>So you migrate the page when the thread moves across CPUs?
Each process has CPU affinity mask and each thread has a CPU affinity, where the process's CPU affinity mask is used to restrict the thread's CPU affinity (and the thread's CPU affinity determines which CPUs it can run on). Normally, when a process is created it's CPU affinity mask is set to all CPUs within the least-loaded NUMA node, and each thread's CPU affinity is set to a subset of this. These CPU affinities can be changed, but normally aren't.
The code that determines which pages are "good canditates" for sending to swap needs to track page usage - I'm hoping to re-use this to determine which physical pages are worth replacing when they belong to the wrong NUMA domain for the thread's CPU affinity (for thread space) or the wrong NUMA domain for the process's CPU affinity mask (for process space).
This avoids replacing all pages when a thread or process is migrated, but means that performance will be worse immediately after migration and will improve with time. It also means that pages that are good candidates to send to swap won't be replaced (unless they actually are sent to swap and then loaded back in).
It also takes care of the problem caused when a thread/process needs to allocate more RAM but there is no free pages for a specific NUMA domain. In this case a "less than perfect" page would be allocated, and would be replaced by a better physical page later on when better physical pages become free.
It's something that will need a lot of benchmarking and tuning to get right though...
>You still have program text and the wider memory space
>bouncing though.
Yes.
>AFAIK there has been work in the past on Linux (and probably
>other OSes) to do things like NUMA page migration, and even
>pagecache NUMA replication (for unmapped or readonly mapped
>pages). Not sure the state of these.
You may mean this patch: http://lwn.net/Articles/179934/
>>>Wrong. Number of cores has nothing to do with it, and
>>>desktops/workstations/small servers will never care much
>>>about NUMA issues because there just aren't enough sockets
>>>to make a difference. Improvement on even an 8 socket
>>>Opteron is probably unmeasurable on Linux, for example.
>>
>>For Opteron one hop is about 25% slower and 2 hops is about 50% slower. I couldn't
>>find figures for 3 hops (which is necessary for 8 sockets when there's only 3 hypertransport
>>links and something needs to connect to an I/O hub), and the figures I did find
>>vary a fair bit between different sources.
>
>I didn't mean memory latency, obviously that is easily
>measurable and relevant to real workloads. I was talking
>about kernel text replication. icache is mostly very well
>behaved (readonly, good locality, high frequency of use)
>and pretty easy to prefetch.
Does the kernel have a ".data" section?
>Well good luck with it. Sounds fun -- post a link if/when
>you feel it is a bit more interesting.
Thanks - don't worry, when I feel it's interesting I'll be plastering links everywhere I go... :-)
Cheers,
Brendan
nick (anon@anon.com) on 5/16/06 wrote:
>Brendan (btrotter@gmail.com) on 5/16/06 wrote:
>>nick (anon@anon.com) on 5/15/06 wrote:
>
>>>Do you use threads of a single memory space running on
>>>different nodes?
>>
>>Yes, but I split user space into "process space" and "thread space", such that
>>thread space can't be accessed from other threads. It's a little like "thread local
>>data" in POSIX, only implemented so that seperation is enforced. The disadvantage
>>is that switching between threads that belong to the same process involves changing
>>address spaces and is as expensive as switching between
>processes. The advantages
>>are that security of the threads local data is enforced, a thread's data doesn't
>>suffer from cacheline bouncing (or "across NUMA node" access penalties if the process
>>itself isn't tied to a specific NUMA node), the linear
>
>So you migrate the page when the thread moves across CPUs?
Each process has CPU affinity mask and each thread has a CPU affinity, where the process's CPU affinity mask is used to restrict the thread's CPU affinity (and the thread's CPU affinity determines which CPUs it can run on). Normally, when a process is created it's CPU affinity mask is set to all CPUs within the least-loaded NUMA node, and each thread's CPU affinity is set to a subset of this. These CPU affinities can be changed, but normally aren't.
The code that determines which pages are "good canditates" for sending to swap needs to track page usage - I'm hoping to re-use this to determine which physical pages are worth replacing when they belong to the wrong NUMA domain for the thread's CPU affinity (for thread space) or the wrong NUMA domain for the process's CPU affinity mask (for process space).
This avoids replacing all pages when a thread or process is migrated, but means that performance will be worse immediately after migration and will improve with time. It also means that pages that are good candidates to send to swap won't be replaced (unless they actually are sent to swap and then loaded back in).
It also takes care of the problem caused when a thread/process needs to allocate more RAM but there is no free pages for a specific NUMA domain. In this case a "less than perfect" page would be allocated, and would be replaced by a better physical page later on when better physical pages become free.
It's something that will need a lot of benchmarking and tuning to get right though...
>You still have program text and the wider memory space
>bouncing though.
Yes.
>AFAIK there has been work in the past on Linux (and probably
>other OSes) to do things like NUMA page migration, and even
>pagecache NUMA replication (for unmapped or readonly mapped
>pages). Not sure the state of these.
You may mean this patch: http://lwn.net/Articles/179934/
>>>Wrong. Number of cores has nothing to do with it, and
>>>desktops/workstations/small servers will never care much
>>>about NUMA issues because there just aren't enough sockets
>>>to make a difference. Improvement on even an 8 socket
>>>Opteron is probably unmeasurable on Linux, for example.
>>
>>For Opteron one hop is about 25% slower and 2 hops is about 50% slower. I couldn't
>>find figures for 3 hops (which is necessary for 8 sockets when there's only 3 hypertransport
>>links and something needs to connect to an I/O hub), and the figures I did find
>>vary a fair bit between different sources.
>
>I didn't mean memory latency, obviously that is easily
>measurable and relevant to real workloads. I was talking
>about kernel text replication. icache is mostly very well
>behaved (readonly, good locality, high frequency of use)
>and pretty easy to prefetch.
Does the kernel have a ".data" section?
>Well good luck with it. Sounds fun -- post a link if/when
>you feel it is a bit more interesting.
Thanks - don't worry, when I feel it's interesting I'll be plastering links everywhere I go... :-)
Cheers,
Brendan
Topic | Posted By | Date |
---|---|---|
Hybrid (micro)kernels | Tzvetan Mikov | 2006/05/08 04:41 PM |
Hybrid (micro)kernels | S. Rao | 2006/05/08 06:14 PM |
Hybrid (micro)kernels | Bill Todd | 2006/05/08 06:16 PM |
Hybrid (micro)kernels | Tzvetan Mikov | 2006/05/08 07:21 PM |
Hybrid (micro)kernels | nick | 2006/05/08 07:50 PM |
Hybrid (micro)kernels | Bill Todd | 2006/05/09 01:26 AM |
There aren't enough words... | Rob Thorpe | 2006/05/09 02:39 AM |
There aren't enough words... | Tzvetan Mikov | 2006/05/09 03:10 PM |
There aren't enough words... | Rob Thorpe | 2006/05/15 12:25 AM |
Hybrid (micro)kernels | Tzvetan Mikov | 2006/05/09 11:17 AM |
Hybrid (micro)kernels | Bill Todd | 2006/05/09 04:05 PM |
Hybrid (micro)kernels | rwessel | 2006/05/08 11:23 PM |
Hybrid kernel, not NT | Richard Urich | 2006/05/09 06:03 AM |
Hybrid kernel, not NT | _Arthur | 2006/05/09 07:06 AM |
Hybrid kernel, not NT | Rob Thorpe | 2006/05/09 07:40 AM |
Hybrid kernel, not NT | _Arthur | 2006/05/09 08:30 AM |
Hybrid kernel, not NT | Rob Thorpe | 2006/05/09 09:07 AM |
Hybrid kernel, not NT | _Arthur | 2006/05/09 09:36 AM |
Linux vs MacOSX peformance, debunked | _Arthur | 2006/05/18 07:30 AM |
Linux vs MacOSX peformance, debunked | Rob Thorpe | 2006/05/18 08:19 AM |
Linux vs MacOSX peformance, debunked | Anonymous | 2006/05/18 12:31 PM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/09 08:16 AM |
Hybrid kernel, not NT | Andi Kleen | 2006/05/09 02:32 PM |
Hybrid kernel, not NT | myself | 2006/05/09 03:24 PM |
Hybrid kernel, not NT | myself | 2006/05/09 03:41 PM |
Hybrid kernel, not NT | Brendan | 2006/05/09 05:26 PM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/09 08:06 PM |
Hybrid kernel, not NT | Brendan | 2006/05/13 01:35 AM |
Hybrid kernel, not NT | nick | 2006/05/13 04:40 AM |
Hybrid kernel, not NT | Brendan | 2006/05/13 09:48 AM |
Hybrid kernel, not NT | nick | 2006/05/13 07:41 PM |
Hybrid kernel, not NT | Brendan | 2006/05/13 09:51 PM |
Hybrid kernel, not NT | nick | 2006/05/14 05:57 PM |
Hybrid kernel, not NT | Brendan | 2006/05/14 10:40 PM |
Hybrid kernel, not NT | nick | 2006/05/14 11:46 PM |
Hybrid kernel, not NT | Brendan | 2006/05/15 04:00 AM |
Hybrid kernel, not NT | rwessel | 2006/05/15 07:21 AM |
Hybrid kernel, not NT | Brendan | 2006/05/15 08:55 AM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/15 09:49 AM |
Hybrid kernel, not NT | nick | 2006/05/15 04:41 PM |
Hybrid kernel, not NT | tony roth | 2008/01/31 02:20 PM |
Hybrid kernel, not NT | nick | 2006/05/15 06:33 PM |
Hybrid kernel, not NT | Brendan | 2006/05/16 01:39 AM |
Hybrid kernel, not NT | nick | 2006/05/16 02:53 AM |
Hybrid kernel, not NT | Brendan | 2006/05/16 05:37 AM |
Hybrid kernel, not NT | Anonymous | 2008/05/01 10:31 PM |
Following the structure of the tree | Michael S | 2008/05/02 04:19 AM |
Following the structure of the tree | Dean Kent | 2008/05/02 05:31 AM |
Following the structure of the tree | Michael S | 2008/05/02 06:02 AM |
Following the structure of the tree | David W. Hess | 2008/05/02 06:48 AM |
Following the structure of the tree | Dean Kent | 2008/05/02 09:14 AM |
Following the structure of the tree | David W. Hess | 2008/05/02 10:05 AM |
LOL! | Dean Kent | 2008/05/02 10:33 AM |
Following the structure of the tree | anonymous | 2008/05/02 03:04 PM |
Following the structure of the tree | Dean Kent | 2008/05/02 07:52 PM |
Following the structure of the tree | Foo_ | 2008/05/03 02:01 AM |
Following the structure of the tree | David W. Hess | 2008/05/03 06:54 AM |
Following the structure of the tree | Dean Kent | 2008/05/03 10:06 AM |
Following the structure of the tree | Foo_ | 2008/05/04 01:06 AM |
Following the structure of the tree | Michael S | 2008/05/04 01:22 AM |
Hybrid kernel, not NT | Linus Torvalds | 2006/05/09 05:19 PM |
Microkernel Vs Monolithic Kernel | Kernel_Protector | 2006/05/09 09:41 PM |
Microkernel Vs Monolithic Kernel | David Kanter | 2006/05/09 10:30 PM |
Sigh, Stand back, its slashdotting time. (NT) | Anonymous | 2006/05/09 10:44 PM |
Microkernel Vs Monolithic Kernel | blah | 2006/05/12 08:58 PM |
Microkernel Vs Monolithic Kernel | Rob Thorpe | 2006/05/15 01:41 AM |
Hybrid kernel, not NT | AnalGuy | 2006/05/16 03:10 AM |
Theory versus practice | David Kanter | 2006/05/16 12:55 PM |
Distributed algorithms | Rob Thorpe | 2006/05/17 12:53 AM |
Theory versus practice | Howard Chu | 2006/05/17 02:54 AM |
Theory versus practice | JS | 2006/05/17 04:29 AM |
Play online poker, blackjack !!! | Gamezonex | 2007/08/16 01:49 PM |
Hybrid kernel, not NT (NT) | atle rene mossik | 2020/12/12 09:31 AM |
Hybrid (micro)kernels | philt | 2006/05/14 09:15 PM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/15 08:20 AM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/15 11:56 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/16 01:22 AM |
Hybrid (micro)kernels | rwessel | 2006/05/16 11:23 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/17 12:43 AM |
Hybrid (micro)kernels | rwessel | 2006/05/17 01:33 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/19 07:51 AM |
Hybrid (micro)kernels | rwessel | 2006/05/19 12:27 PM |
Hybrid (micro)kernels | techIperson | 2006/05/15 01:25 PM |
Hybrid (micro)kernels | mas | 2006/05/15 05:17 PM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/15 05:39 PM |
Hybrid (micro)kernels | Colonel Kernel | 2006/05/15 09:17 PM |
Hybrid (micro)kernels | Wink Saville | 2006/05/15 10:31 PM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/16 10:08 AM |
Hybrid (micro)kernels | Wink Saville | 2006/05/16 09:55 PM |
Hybrid (micro)kernels | rwessel | 2006/05/16 11:31 AM |
Hybrid (micro)kernels | Linus Torvalds | 2006/05/16 12:00 PM |
Hybrid (micro)kernels | Brendan | 2006/05/16 01:36 AM |
Hybrid (micro)kernels | Paul Elliott | 2006/09/03 08:44 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/09/04 09:25 AM |
Hybrid (micro)kernels | philt | 2006/05/16 12:55 AM |
Hybrid (micro)kernels | pgerassi | 2007/08/16 07:41 PM |
Another questionable entry on Wikipedia? | Chung Leong | 2006/05/18 10:33 AM |
Hybrid (micro)kernels | israel | 2006/05/20 04:25 AM |
Hybrid (micro)kernels | Rob Thorpe | 2006/05/22 08:35 AM |