By: Michael S (already5chosen.delete@this.yahoo.com), December 6, 2014 3:48 pm
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on December 4, 2014 12:01 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on December 4, 2014 9:11 am wrote:
>
> > An example: Multiple threads incrementing a value to get a
> > unique identifier would be serialized with communication
> > delay in a traditional implementation (in addition to having
> > more coherence traffic, which could be a performance
> > constraint). With a centralized arbiter the communication delay is one round trip between core and arbiter
> > rather than multiple serialized trips between different requesters and previous owners.
> >
> > (Perhaps I am missing something.)
> >
> > (A perhaps better interface would allow each requester to ask for the response to be put in a
> > mailbox since the result is probably not an immediate data dependency. I suppose such could be
> > kludged in to existing systems by adding an undefined state for blocks of memory, so the requester
> > would undefine a block, ask the arbiter/controller to write the response to that block, and hardware
> > could treat a read of an undefined block something like a MWAIT instruction.)
> >
>
> There are a couple of ways to deal with this. What's best generally will depend on what the coherency
> protocol is. Assuming some sort of directory/home node protocol, you can do hot line detection (either
> heuristically or explicitly) and export that request to the home node. So for instance, for a CMPXCHG
> type op, you would directly export the CMPXCHG to the home node which would have the line in its hot
> cache. The home node would do the CMPXCHG op, and send back either an ACK or NACK. FETCH_ADD would
> work similarly where you ship the operation to the home node and it does it.
>
Reusing my previous answer to your previous post with the same idea:
You assume particular use case, specifically the case when the next access to memory location in question is likely to come either from different core or long time thereafter.
However, if the next access is more likely to come from the same core and soon then your performance and power win suddenly became performance and power loss.
> In most cases, esp with hot lines, this will both reduce latency, reduce congestion, and reduce
> bandwidth on the interconnect. Realistically, you would also want a LOAD_UNCACHED instruction
> as well for any reading for the line so it doesn't ever become cache resident. But even in
> the cache resident case you are going to generally come out ahead, esp in high load cases
> as the line will simply be invalidated and not be constantly shipped around.
>
> This actually brings me to an architecture area that is of interest to me: Pretty much all architectures
> are still completely cache unaware. Most of out coherency primitives and their functions are designed with
> willful ignorance of the concept of caches. Basically all load store ops are still designed around a model
> where everything is directly connected to a shared bus without caches. Given the modern reality and the
> likelihood of it going forward, there is a lot of optimization space available around making the whole load/store
> ISA space much more cache aware going directly back into the actual programs themselves.
>
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on December 4, 2014 9:11 am wrote:
>
> > An example: Multiple threads incrementing a value to get a
> > unique identifier would be serialized with communication
> > delay in a traditional implementation (in addition to having
> > more coherence traffic, which could be a performance
> > constraint). With a centralized arbiter the communication delay is one round trip between core and arbiter
> > rather than multiple serialized trips between different requesters and previous owners.
> >
> > (Perhaps I am missing something.)
> >
> > (A perhaps better interface would allow each requester to ask for the response to be put in a
> > mailbox since the result is probably not an immediate data dependency. I suppose such could be
> > kludged in to existing systems by adding an undefined state for blocks of memory, so the requester
> > would undefine a block, ask the arbiter/controller to write the response to that block, and hardware
> > could treat a read of an undefined block something like a MWAIT instruction.)
> >
>
> There are a couple of ways to deal with this. What's best generally will depend on what the coherency
> protocol is. Assuming some sort of directory/home node protocol, you can do hot line detection (either
> heuristically or explicitly) and export that request to the home node. So for instance, for a CMPXCHG
> type op, you would directly export the CMPXCHG to the home node which would have the line in its hot
> cache. The home node would do the CMPXCHG op, and send back either an ACK or NACK. FETCH_ADD would
> work similarly where you ship the operation to the home node and it does it.
>
Reusing my previous answer to your previous post with the same idea:
You assume particular use case, specifically the case when the next access to memory location in question is likely to come either from different core or long time thereafter.
However, if the next access is more likely to come from the same core and soon then your performance and power win suddenly became performance and power loss.
> In most cases, esp with hot lines, this will both reduce latency, reduce congestion, and reduce
> bandwidth on the interconnect. Realistically, you would also want a LOAD_UNCACHED instruction
> as well for any reading for the line so it doesn't ever become cache resident. But even in
> the cache resident case you are going to generally come out ahead, esp in high load cases
> as the line will simply be invalidated and not be constantly shipped around.
>
> This actually brings me to an architecture area that is of interest to me: Pretty much all architectures
> are still completely cache unaware. Most of out coherency primitives and their functions are designed with
> willful ignorance of the concept of caches. Basically all load store ops are still designed around a model
> where everything is directly connected to a shared bus without caches. Given the modern reality and the
> likelihood of it going forward, there is a lot of optimization space available around making the whole load/store
> ISA space much more cache aware going directly back into the actual programs themselves.
>