By: David Kanter (dkanter.delete@this.realworldtech.com), August 28, 2007 12:43 pm
Room: Moderated Discussions
Paul (no@thanks.com) on 8/28/07 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 8/28/07 wrote:
>---------------------------
>>I hope you all enjoy the read. I'd also like to thank everyone who helped with
>>this article. I relied on the technical expertise of quite a few friends, and
>>without their help this article wouldn't be nearly as compete or understandable.
>
>Regarding Xeon MPs and cache size you usually do need more >cache the more processors
>you put on an SMP node. Any thread may be scheduled on any processor as each processor
>has to have the working sets for all the threads in cache >unless you want to wait for a fill from memory.
I agree.
>I don't know how Opteron gets away with its L2 caches being small. Some of it is
>probably declaring that it's a NUMA system to the OS so it takes CPU affinity into
>account when making scheduling decisions.
Low memory latency for local stuff is probably the answer. Although some cache-happy benchmarks like SPECjbb really do suffer.
>Does Opteron do anything like have the memory requests for >a line fill from a remote
>processor be serviced from within the L2 cache so you can >essentially shuttle working
>sets between L2 caches and not have to go to memory?
It depends on what state the line is in. In general, I believe the Opteron coherency protocol requires that the requesting processor receive all responses (from caching agents and the home node) before using data.
This may not be the case if the data is in a dirty cache line, but I'm not 100% sure, since AMD's protocol is not particularly open to the public.
DK
---------------------------
>David Kanter (dkanter@realworldtech.com) on 8/28/07 wrote:
>---------------------------
>>I hope you all enjoy the read. I'd also like to thank everyone who helped with
>>this article. I relied on the technical expertise of quite a few friends, and
>>without their help this article wouldn't be nearly as compete or understandable.
>
>Regarding Xeon MPs and cache size you usually do need more >cache the more processors
>you put on an SMP node. Any thread may be scheduled on any processor as each processor
>has to have the working sets for all the threads in cache >unless you want to wait for a fill from memory.
I agree.
>I don't know how Opteron gets away with its L2 caches being small. Some of it is
>probably declaring that it's a NUMA system to the OS so it takes CPU affinity into
>account when making scheduling decisions.
Low memory latency for local stuff is probably the answer. Although some cache-happy benchmarks like SPECjbb really do suffer.
>Does Opteron do anything like have the memory requests for >a line fill from a remote
>processor be serviced from within the L2 cache so you can >essentially shuttle working
>sets between L2 caches and not have to go to memory?
It depends on what state the line is in. In general, I believe the Opteron coherency protocol requires that the requesting processor receive all responses (from caching agents and the home node) before using data.
This may not be the case if the data is in a dirty cache line, but I'm not 100% sure, since AMD's protocol is not particularly open to the public.
DK