By: David Kanter (dkanter.delete@this.realworldtech.com), September 16, 2007 4:34 pm
Room: Moderated Discussions
Peter Gerdes (truepath@infiniteinjury.org) on 9/16/07 wrote:
---------------------------
>Hey, thanks for your paitence. I end up using some caps >below for emphasis, they aren't meant to express >frustration.
Current K8 systems look like this:
0 - 1
| |
2 - 3
Where 0-3 are processors with attached memory. Future systems from both Intel and AMD will look like:
0 - 1
| x |
2 - 3
>Now the question being asked is when does MOESI cache coherency protocol offer an advantage over a MESIF >protocol.
>
>Well the benefit of the O state is that if chip A has a modified cache line (call
>it L) and both chip's B and C request to read that cache line then chip A can trasition
>L to the state O and pass it off to B and C without doing a write. My claim is
>that in the vast majority of cases the MESIF protocol can do exactly (or almost exactly) the same thing.
OK, so the conditions are actually weaker than this. Imagine that A has a modified line, and then B requests it to read. In MESIF, this could be resolved two ways:
Solution 1:
Evict cache line from A and send to B as modified
Solution 2:
Write back to memory
Send line to B as shared, and switch line in A to shared
Note that solution 1 doesn't work if many people would like to request the line.
>Why? So suppose that cache line L (in A's cache) corresponds to a memory location
>in bank A and that as above B and C request to both read that location. The MESIF
>protocol requires that A first write it's line out to memory and transitioning to
>the F state before passing it on to B and C. But SINCE THE MEMORY CONTROLLER THAT
>WOULD WRITE L TO MEMORY IS INTEGRATED INTO CHIP A NO ACTUAL WRITE HAS TO TAKE PLACE.
Actually it does. You can't just pretend not to write stuff to memory, or you'll be totally screwed if you get an uncorrectable bit flip in your cache.
>In other words A just *immediately* tells the other chips that it has written L
>to memory and hands out the cache line to B and C making an *internal* mark to write
>L to memory before eliminating it from it's own cache. Correctness is guaranteed
>because the only way any other chip can read or write to the memory backing L is through chip A.
But now you have to pin a cache line into memory (which x86s can't do), what happens if that line gets evicted? Then it has to be written back to memory. No matter what, you are creating a lot of complexity for yourself this way.
>So what if L isn't backed by memory A controls? Let's suppose that L is backed
>by memory that B controls. Well in this case when B requests the cache line A already
>has to transfer that cache line over to B. Therefore (maybe 1) extra message (depending
>on how reads with C work) will be required and no extra transfers of a cache line.
>There is still no requirement to wait on memory since B can now pull the same trick I described for A above.
>It seems that the only time that the O state will make a significant difference
>is when we have a modified cache line L in chip C's cache backed by memory controlled
>by A requested by chip D. In other words O only makes a real difference when both
>the chip holding the modified cache line and the requesting chip don't control the backing memory.
>
>Hopefully this was a bit more clear but I will answer what you said below as well.
I see what you're saying, but it sounds fairly complex and much more work than it's worth.
[snip]
>We were talking about a situation where each chip has an inbuilt memory controller.
>I'm just saying that the only way that any other chips (B,C,D) in the system know
>about the state of a memory bank A controlled by the integrated memory controller
>on chip A is through the HT/CSI links between chip A and B,C,D. In terms of my
>diagram above I merely mean that B,C, D can't listen in on the double line between chip A and memory bank A.
Ah, ok.
>>>Now when the cache coherency protocol says that a cache line must be written back
>>>to memory it doesn't actually care if the line is 'really' >stored in the actual
>>>memory bank, only that it APPEARS to be so stored, i.e., >the memory controller could implement it's own cache.
>>
>>That's not a cache, it's a buffer. But sure, you could buffer the writes - you
>>just need to make sure that if you lose power you don't have any problems.
>
>It's an integrated memory controller so if the chip loses power we probably lost
>the information in memory anyway (is this not true in some configurations?).
Fair enough. My point is that you need to be able to handle any conceivable corner case (uncorrectable error in a cache line, read or write shootdowns, etc.).
>Also it's a cache if it serves future read requests out of it and lets write requests
>change items that haven't yet been written out to real memory but this is irrelevant.
>>>Thus presumably a chip that needs to 'write' a cache line >to memory it controls
>>>doesn't need to send any messages or do anything but >remember that this cache line
>>>has been 'written' to memory.
>>
>>Where do you want to store that information? In the memory controller, in the chip, etc.?
So you would have add yet another status bit the 'supposed to be written to memory' bit, that is distinct from the dirty bit?
>IN THE CACHE LINE!! The original question was: Does having an OWNED state in the
>cache coherency protocol make a noticeable performance difference. May claim is
>that no because even in a MESIF protocol chips with integrated memory controllers
>can duplicate the effect in the vast majority of the time.
>
>In other words WHEN A CHIP USING THE MESIF PROTOCOL CONTROLS THE MEMORY BACKING
>A CACHE LINE IT CAN ACT AS IF IT HAD THE MEMORY IN AN O STATE.
Here's the catch though - the memory controller controls the memory. It doesn't control the cache, and you don't want to introduce any dependencies between the two.
>>>So long as every read request by another chip on
>>>that memory location reflects the modified value everything >is hunky dory.
>>
>>Sure. The problem is not the common case though, it's probably in handling exceptional cases.
>
>This is just a premise to indicate what I'm saying next.
>
>
>>>Thus
>>>since MOST logical writes to memory that the O state would >eliminate don't require
>>>any PHYSICAL writes to memory it doesn't do much for >efficiency.
>>
>>Um, so write back buffers have to write to memory eventually. You don't eliminate
>>the write, you just defer it in your system.
>
>YES! DOING EXACTLY THE SAME THING ALLOWING THE O STATE WOULD.
No, you don't get it. The O state actually ELIMINATES THE WRITE. Let me give you a concrete example:
CPU0 writes a cache line
CPU1,2 ask for a shared copy, CPU 0 has it in O, others in S
CPU1,2 read cache line
CPU 0 writes again and invalidates CPU1,2, leaving cache line in M state
CPU1,2 ask for a shared copy, CPU 0 has it in O, others in S
Repeat...
Now I don't know how often this happens, but this sequence only requires a single write back at the end. Under the MESIF system, you'd actually have to write it back for every iteration. So you actually could save quite a few writes with an O state.
DK
---------------------------
>Hey, thanks for your paitence. I end up using some caps >below for emphasis, they aren't meant to express >frustration.
Current K8 systems look like this:
0 - 1
| |
2 - 3
Where 0-3 are processors with attached memory. Future systems from both Intel and AMD will look like:
0 - 1
| x |
2 - 3
>Now the question being asked is when does MOESI cache coherency protocol offer an advantage over a MESIF >protocol.
>
>Well the benefit of the O state is that if chip A has a modified cache line (call
>it L) and both chip's B and C request to read that cache line then chip A can trasition
>L to the state O and pass it off to B and C without doing a write. My claim is
>that in the vast majority of cases the MESIF protocol can do exactly (or almost exactly) the same thing.
OK, so the conditions are actually weaker than this. Imagine that A has a modified line, and then B requests it to read. In MESIF, this could be resolved two ways:
Solution 1:
Evict cache line from A and send to B as modified
Solution 2:
Write back to memory
Send line to B as shared, and switch line in A to shared
Note that solution 1 doesn't work if many people would like to request the line.
>Why? So suppose that cache line L (in A's cache) corresponds to a memory location
>in bank A and that as above B and C request to both read that location. The MESIF
>protocol requires that A first write it's line out to memory and transitioning to
>the F state before passing it on to B and C. But SINCE THE MEMORY CONTROLLER THAT
>WOULD WRITE L TO MEMORY IS INTEGRATED INTO CHIP A NO ACTUAL WRITE HAS TO TAKE PLACE.
Actually it does. You can't just pretend not to write stuff to memory, or you'll be totally screwed if you get an uncorrectable bit flip in your cache.
>In other words A just *immediately* tells the other chips that it has written L
>to memory and hands out the cache line to B and C making an *internal* mark to write
>L to memory before eliminating it from it's own cache. Correctness is guaranteed
>because the only way any other chip can read or write to the memory backing L is through chip A.
But now you have to pin a cache line into memory (which x86s can't do), what happens if that line gets evicted? Then it has to be written back to memory. No matter what, you are creating a lot of complexity for yourself this way.
>So what if L isn't backed by memory A controls? Let's suppose that L is backed
>by memory that B controls. Well in this case when B requests the cache line A already
>has to transfer that cache line over to B. Therefore (maybe 1) extra message (depending
>on how reads with C work) will be required and no extra transfers of a cache line.
>There is still no requirement to wait on memory since B can now pull the same trick I described for A above.
>It seems that the only time that the O state will make a significant difference
>is when we have a modified cache line L in chip C's cache backed by memory controlled
>by A requested by chip D. In other words O only makes a real difference when both
>the chip holding the modified cache line and the requesting chip don't control the backing memory.
>
>Hopefully this was a bit more clear but I will answer what you said below as well.
I see what you're saying, but it sounds fairly complex and much more work than it's worth.
[snip]
>We were talking about a situation where each chip has an inbuilt memory controller.
>I'm just saying that the only way that any other chips (B,C,D) in the system know
>about the state of a memory bank A controlled by the integrated memory controller
>on chip A is through the HT/CSI links between chip A and B,C,D. In terms of my
>diagram above I merely mean that B,C, D can't listen in on the double line between chip A and memory bank A.
Ah, ok.
>>>Now when the cache coherency protocol says that a cache line must be written back
>>>to memory it doesn't actually care if the line is 'really' >stored in the actual
>>>memory bank, only that it APPEARS to be so stored, i.e., >the memory controller could implement it's own cache.
>>
>>That's not a cache, it's a buffer. But sure, you could buffer the writes - you
>>just need to make sure that if you lose power you don't have any problems.
>
>It's an integrated memory controller so if the chip loses power we probably lost
>the information in memory anyway (is this not true in some configurations?).
Fair enough. My point is that you need to be able to handle any conceivable corner case (uncorrectable error in a cache line, read or write shootdowns, etc.).
>Also it's a cache if it serves future read requests out of it and lets write requests
>change items that haven't yet been written out to real memory but this is irrelevant.
>>>Thus presumably a chip that needs to 'write' a cache line >to memory it controls
>>>doesn't need to send any messages or do anything but >remember that this cache line
>>>has been 'written' to memory.
>>
>>Where do you want to store that information? In the memory controller, in the chip, etc.?
So you would have add yet another status bit the 'supposed to be written to memory' bit, that is distinct from the dirty bit?
>IN THE CACHE LINE!! The original question was: Does having an OWNED state in the
>cache coherency protocol make a noticeable performance difference. May claim is
>that no because even in a MESIF protocol chips with integrated memory controllers
>can duplicate the effect in the vast majority of the time.
>
>In other words WHEN A CHIP USING THE MESIF PROTOCOL CONTROLS THE MEMORY BACKING
>A CACHE LINE IT CAN ACT AS IF IT HAD THE MEMORY IN AN O STATE.
Here's the catch though - the memory controller controls the memory. It doesn't control the cache, and you don't want to introduce any dependencies between the two.
>>>So long as every read request by another chip on
>>>that memory location reflects the modified value everything >is hunky dory.
>>
>>Sure. The problem is not the common case though, it's probably in handling exceptional cases.
>
>This is just a premise to indicate what I'm saying next.
>
>
>>>Thus
>>>since MOST logical writes to memory that the O state would >eliminate don't require
>>>any PHYSICAL writes to memory it doesn't do much for >efficiency.
>>
>>Um, so write back buffers have to write to memory eventually. You don't eliminate
>>the write, you just defer it in your system.
>
>YES! DOING EXACTLY THE SAME THING ALLOWING THE O STATE WOULD.
No, you don't get it. The O state actually ELIMINATES THE WRITE. Let me give you a concrete example:
CPU0 writes a cache line
CPU1,2 ask for a shared copy, CPU 0 has it in O, others in S
CPU1,2 read cache line
CPU 0 writes again and invalidates CPU1,2, leaving cache line in M state
CPU1,2 ask for a shared copy, CPU 0 has it in O, others in S
Repeat...
Now I don't know how often this happens, but this sequence only requires a single write back at the end. Under the MESIF system, you'd actually have to write it back for every iteration. So you actually could save quite a few writes with an O state.
DK