By: anon (anon.delete@this.anon.com), July 21, 2015 6:22 am
Room: Moderated Discussions
Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 21, 2015 12:08 am wrote:
> anon (anon.delete@this.anon.com) on July 20, 2015 7:29 am wrote:
> > Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 20, 2015 4:44 am wrote:
> > > Except that barrier operations are -- at least by default -- global: the store queues of all
> > > coherent CPUs are drained when a (global) barrier instruction is executed (by one CPU).
> >
> > Which CPUs and which barrier instructions might those be?
> >
>
> I know of Power(PC) and ARM.
I don't believe that is the case for Power. Not sure about ARM, I don't know so much about it.
From the Power ISA 2.07 manual, Book II 1.7:
It always talks about storage access *with respect to* the processor that executed the barrier. The extension to access by other processors than P1 I believe is specifying causality (notice the first point says performed *with respect to P1* before the barrier).
I can't find anything that would require an implementation to flush remote store queues in response to barriers (particularly not lwsync, which orders accesses to cacheable memory), but even access mmio/caching inhibited memory suggests you can't rely on barrier to affect remote CPUs. E.g., in Book II, 1.6, in respect to caching inhibited storage:
> anon (anon.delete@this.anon.com) on July 20, 2015 7:29 am wrote:
> > Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 20, 2015 4:44 am wrote:
> > > Except that barrier operations are -- at least by default -- global: the store queues of all
> > > coherent CPUs are drained when a (global) barrier instruction is executed (by one CPU).
> >
> > Which CPUs and which barrier instructions might those be?
> >
>
> I know of Power(PC) and ARM.
I don't believe that is the case for Power. Not sure about ARM, I don't know so much about it.
From the Power ISA 2.07 manual, Book II 1.7:
When a processor (P1) executes a Synchronize, eieio, or mbar instruction a memory barrier is created, which orders applicable storage accesses pairwise, as follows. Let A be a set of storage accesses that includes all storage accesses associated with instructions preceding the barrier-creating instruction, and let B be a set of storage accesses that includes all storage accesses associated with instructions following the
barrier-creating instruction. For each applicable pair a i ,b j of storage accesses such that a i is in A and b j is in B, the memory barrier ensures that a i will be performed with respect to any processor or mechanism, to the extent required by the associated Memory Coherence Required attributes, before b j is performed with respect to that processor or mechanism. The ordering done by a memory barrier is said to be "cumulative" if it also orders storage accesses that are performed by processors and mechanisms other than P1, as follows.
- A includes all applicable storage accesses by any such processor or mechanism that have been performed with respect to P1 before the memory barrier is created.
- B includes all applicable storage accesses by any such processor or mechanism that are performed after a Load instruction executed by that processor or mechanism has returned the value stored by a store that is in B.
It always talks about storage access *with respect to* the processor that executed the barrier. The extension to access by other processors than P1 I believe is specifying causality (notice the first point says performed *with respect to P1* before the barrier).
I can't find anything that would require an implementation to flush remote store queues in response to barriers (particularly not lwsync, which orders accesses to cacheable memory), but even access mmio/caching inhibited memory suggests you can't rely on barrier to affect remote CPUs. E.g., in Book II, 1.6, in respect to caching inhibited storage:
None of the memory barrier instructions prevent the combining of accesses from different processors.