By: Aaron Spink (aaronspink.delete@this.notearthlink.net), December 4, 2014 12:01 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on December 4, 2014 9:11 am wrote:
> An example: Multiple threads incrementing a value to get a unique identifier would be serialized with communication
> delay in a traditional implementation (in addition to having more coherence traffic, which could be a performance
> constraint). With a centralized arbiter the communication delay is one round trip between core and arbiter
> rather than multiple serialized trips between different requesters and previous owners.
>
> (Perhaps I am missing something.)
>
> (A perhaps better interface would allow each requester to ask for the response to be put in a
> mailbox since the result is probably not an immediate data dependency. I suppose such could be
> kludged in to existing systems by adding an undefined state for blocks of memory, so the requester
> would undefine a block, ask the arbiter/controller to write the response to that block, and hardware
> could treat a read of an undefined block something like a MWAIT instruction.)
>
There are a couple of ways to deal with this. What's best generally will depend on what the coherency protocol is. Assuming some sort of directory/home node protocol, you can do hot line detection (either heuristically or explicitly) and export that request to the home node. So for instance, for a CMPXCHG type op, you would directly export the CMPXCHG to the home node which would have the line in its hot cache. The home node would do the CMPXCHG op, and send back either an ACK or NACK. FETCH_ADD would work similarly where you ship the operation to the home node and it does it.
In most cases, esp with hot lines, this will both reduce latency, reduce congestion, and reduce bandwidth on the interconnect. Realistically, you would also want a LOAD_UNCACHED instruction as well for any reading for the line so it doesn't ever become cache resident. But even in the cache resident case you are going to generally come out ahead, esp in high load cases as the line will simply be invalidated and not be constantly shipped around.
This actually brings me to an architecture area that is of interest to me: Pretty much all architectures are still completely cache unaware. Most of out coherency primitives and their functions are designed with willful ignorance of the concept of caches. Basically all load store ops are still designed around a model where everything is directly connected to a shared bus without caches. Given the modern reality and the likelihood of it going forward, there is a lot of optimization space available around making the whole load/store ISA space much more cache aware going directly back into the actual programs themselves.
> An example: Multiple threads incrementing a value to get a unique identifier would be serialized with communication
> delay in a traditional implementation (in addition to having more coherence traffic, which could be a performance
> constraint). With a centralized arbiter the communication delay is one round trip between core and arbiter
> rather than multiple serialized trips between different requesters and previous owners.
>
> (Perhaps I am missing something.)
>
> (A perhaps better interface would allow each requester to ask for the response to be put in a
> mailbox since the result is probably not an immediate data dependency. I suppose such could be
> kludged in to existing systems by adding an undefined state for blocks of memory, so the requester
> would undefine a block, ask the arbiter/controller to write the response to that block, and hardware
> could treat a read of an undefined block something like a MWAIT instruction.)
>
There are a couple of ways to deal with this. What's best generally will depend on what the coherency protocol is. Assuming some sort of directory/home node protocol, you can do hot line detection (either heuristically or explicitly) and export that request to the home node. So for instance, for a CMPXCHG type op, you would directly export the CMPXCHG to the home node which would have the line in its hot cache. The home node would do the CMPXCHG op, and send back either an ACK or NACK. FETCH_ADD would work similarly where you ship the operation to the home node and it does it.
In most cases, esp with hot lines, this will both reduce latency, reduce congestion, and reduce bandwidth on the interconnect. Realistically, you would also want a LOAD_UNCACHED instruction as well for any reading for the line so it doesn't ever become cache resident. But even in the cache resident case you are going to generally come out ahead, esp in high load cases as the line will simply be invalidated and not be constantly shipped around.
This actually brings me to an architecture area that is of interest to me: Pretty much all architectures are still completely cache unaware. Most of out coherency primitives and their functions are designed with willful ignorance of the concept of caches. Basically all load store ops are still designed around a model where everything is directly connected to a shared bus without caches. Given the modern reality and the likelihood of it going forward, there is a lot of optimization space available around making the whole load/store ISA space much more cache aware going directly back into the actual programs themselves.