By: Konrad Schwarz (no.spam.delete@this.no.spam), December 4, 2014 1:10 pm
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on December 4, 2014 12:01 pm wrote:
coherency
> protocol is. Assuming some sort of directory/home node protocol, you can do hot line detection (either
> heuristically or explicitly) and export that request to the home node. So for instance, for a CMPXCHG
> type op, you would directly export the CMPXCHG to the home node which would have the line in its hot
> cache. The home node would do the CMPXCHG op, and send back either an ACK or NACK. FETCH_ADD would
> work similarly where you ship the operation to the home node and it does it.
Note that you will also need to return the current or the previous value when the transaction
is successful.
I'm no expert in VLSI design, but this seems like a large increase in complexity in the coherency
protocol for a modest benefit. I find it more likely that the line in question is protected from eviction from the local cache for a short period of time. My understanding is that this is
basically how modern x86 implements atomic operations on cacheable memory.
And honestly, I still think most programmers would be better served by using ligh weight (in the
uncontended case) synchronization objects provided by the OS (i.e., mutex/condition variables)
rather than employing atomic operations directly.
The code base / programmer mindset that relies on atomic operations stems from
a time/from operating systems where such light weight were not available.
> This actually brings me to an architecture area that is of interest to me: Pretty much all architectures
> are still completely cache unaware. Most of out coherency primitives and their functions are designed with
> willful ignorance of the concept of caches. Basically all load store ops are still designed around a model
> where everything is directly connected to a shared bus without caches. Given the modern reality and the
> likelihood of it going forward, there is a lot of optimization space available around making the whole load/store
> ISA space much more cache aware going directly back into the actual
I think the Cell SPE showed that more complex memory models are unpalatable to programmers.
coherency
> protocol is. Assuming some sort of directory/home node protocol, you can do hot line detection (either
> heuristically or explicitly) and export that request to the home node. So for instance, for a CMPXCHG
> type op, you would directly export the CMPXCHG to the home node which would have the line in its hot
> cache. The home node would do the CMPXCHG op, and send back either an ACK or NACK. FETCH_ADD would
> work similarly where you ship the operation to the home node and it does it.
Note that you will also need to return the current or the previous value when the transaction
is successful.
I'm no expert in VLSI design, but this seems like a large increase in complexity in the coherency
protocol for a modest benefit. I find it more likely that the line in question is protected from eviction from the local cache for a short period of time. My understanding is that this is
basically how modern x86 implements atomic operations on cacheable memory.
And honestly, I still think most programmers would be better served by using ligh weight (in the
uncontended case) synchronization objects provided by the OS (i.e., mutex/condition variables)
rather than employing atomic operations directly.
The code base / programmer mindset that relies on atomic operations stems from
a time/from operating systems where such light weight were not available.
> This actually brings me to an architecture area that is of interest to me: Pretty much all architectures
> are still completely cache unaware. Most of out coherency primitives and their functions are designed with
> willful ignorance of the concept of caches. Basically all load store ops are still designed around a model
> where everything is directly connected to a shared bus without caches. Given the modern reality and the
> likelihood of it going forward, there is a lot of optimization space available around making the whole load/store
> ISA space much more cache aware going directly back into the actual
I think the Cell SPE showed that more complex memory models are unpalatable to programmers.