By: anon (anon.delete@this.anon.com), July 21, 2015 7:17 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on July 21, 2015 6:22 am wrote:
> Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 21, 2015 12:08 am wrote:
> > anon (anon.delete@this.anon.com) on July 20, 2015 7:29 am wrote:
> > > Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 20, 2015 4:44 am wrote:
> > > > Except that barrier operations are -- at least by default -- global: the store queues of all
> > > > coherent CPUs are drained when a (global) barrier instruction is executed (by one CPU).
> > >
> > > Which CPUs and which barrier instructions might those be?
> > >
> >
> > I know of Power(PC) and ARM.
>
> I don't believe that is the case for Power. Not sure about ARM, I don't know so much about it.
>
> From the Power ISA 2.07 manual, Book II 1.7:
>
>
>
> It always talks about storage access *with respect to* the processor that executed the barrier.
> The extension to access by other processors than P1 I believe is specifying causality (notice
> the first point says performed *with respect to P1* before the barrier).
>
> I can't find anything that would require an implementation to flush remote store queues
> in response to barriers (particularly not lwsync, which orders accesses to cacheable memory),
> but even access mmio/caching inhibited memory suggests you can't rely on barrier to affect
> remote CPUs. E.g., in Book II, 1.6, in respect to caching inhibited storage:
>
>
>
I got the ARMv8-A reference. It seems to be similar to Power ISA.
"[For DMB instruction] If the required shareability is Full system then the operation applies to all observers within the system."
Now that may sound like the barrier operation should effectively be executed by all CPUs, but I don't believe that is the case. It's just saying that the set of *observers* can determine of the memory accesses.
This does the same thing as Power ISA as far as I can see, and establishes causality ordering, but no further requirement. Actually it clearly shows that group A can not contain stores of any other CPUs than the one executing the barrier, and group B only contains operations of another CPU that appear after it loaded some data found already in group B. If the other CPU is concurrently performing only stores, they will be in neither A or B.
Perhaps your claim is less strong: barriers *may* be required to drain store queues on remote CPUs depending on what that remote CPU is currently doing (so in any case will have to be broadcast and checked by those remote CPUs).
Can you provide an example of how reordering of remote CPU must be constrained in the case of a barrier? I can't see where the need would be, and we know that Intel's x86 CPUs implements extremely strong ordering (of its locked instructions -- global sequential ordering) in a handful of cycles which would be impossible if it had to round-trip off-core let alone off-die. Which seems to provide a counterexample that such ordering can be implemented without having to potentially drain remote store queues. So what stronger ordering would require it?
Now perhaps your claim is even less strong that barriers may be required to drain store queues in the case of an implementation of the ISA which implements a weaker memory ordering than Haswell.
That would still be interesting to know of. I do recall a long ago conversation with an IBM guy who worked on POWER implementation talking about implementation of one of their barriers (sync, or perhaps eieio) being improved such that it no longer had to go to fabric, somewhere around POWER4 to POWER5 transition. Unfortunately I've long forgotten the details, and it may have been specific to MMIO concerns.
> Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 21, 2015 12:08 am wrote:
> > anon (anon.delete@this.anon.com) on July 20, 2015 7:29 am wrote:
> > > Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 20, 2015 4:44 am wrote:
> > > > Except that barrier operations are -- at least by default -- global: the store queues of all
> > > > coherent CPUs are drained when a (global) barrier instruction is executed (by one CPU).
> > >
> > > Which CPUs and which barrier instructions might those be?
> > >
> >
> > I know of Power(PC) and ARM.
>
> I don't believe that is the case for Power. Not sure about ARM, I don't know so much about it.
>
> From the Power ISA 2.07 manual, Book II 1.7:
>
>
> When a processor (P1) executes a Synchronize, eieio, or mbar instruction a memory barrier is created, which
> orders applicable storage accesses pairwise, as follows. Let A be a set of storage accesses that includes
> all storage accesses associated with instructions preceding the barrier-creating instruction, and let B be
> a set of storage accesses that includes all storage accesses associated with instructions following the
> barrier-creating instruction. For each applicable pair a i ,b j of storage accesses such that a i is in A and
> b j is in B, the memory barrier ensures that a i will be performed with respect to any processor or mechanism,
> to the extent required by the associated Memory Coherence Required attributes, before b j is performed with respect
> to that processor or mechanism. The ordering done by a memory barrier is said to be "cumulative" if it also orders
> storage accesses that are performed by processors and mechanisms other than P1, as follows.
>
> - A includes all applicable storage accesses by any such processor or mechanism that
> have been performed with respect to P1 before the memory barrier is created.
>
> - B includes all applicable storage accesses by any such processor or mechanism that are performed after a Load
> instruction executed by that processor or mechanism has returned the value stored by a store that is in B.
>
>
> It always talks about storage access *with respect to* the processor that executed the barrier.
> The extension to access by other processors than P1 I believe is specifying causality (notice
> the first point says performed *with respect to P1* before the barrier).
>
> I can't find anything that would require an implementation to flush remote store queues
> in response to barriers (particularly not lwsync, which orders accesses to cacheable memory),
> but even access mmio/caching inhibited memory suggests you can't rely on barrier to affect
> remote CPUs. E.g., in Book II, 1.6, in respect to caching inhibited storage:
>
>
> None of the memory barrier instructions prevent the combining of accesses from different processors.
>
>
I got the ARMv8-A reference. It seems to be similar to Power ISA.
"[For DMB instruction] If the required shareability is Full system then the operation applies to all observers within the system."
Now that may sound like the barrier operation should effectively be executed by all CPUs, but I don't believe that is the case. It's just saying that the set of *observers* can determine of the memory accesses.
A DMB creates two groups of memory accesses, Group A and Group B:
Group A Contains:
- All explicit memory accesses of the required access types from observers in the same required shareability domain as PEe that are observed by PEe before the DMB instruction. These accesses include any accesses of the required access types performed by PEe.
- All loads of required access types from an observer PEx in the same required shareability domain as PEe that have been observed by any given different observer, PEy, in the same required shareability domain as PEe before PEy has performed a memory access that is a member of Group A.
Group B Contains:
- All explicit memory accesses of the required access types by PEe that occur in program order after the DMB instruction.
- All explicit memory accesses of the required access types by any given observer PEx in the same required shareability domain as PEe that can only occur after a load by PEx has returned the result of a store that is a member of Group B.
Any observer with the same required shareability domain as PEe observes all members of Group A before it observes any member of Group B to the extent that those group members are required to be observed, as determined by the shareability and cacheability of the memory locations accessed by the group members.
This does the same thing as Power ISA as far as I can see, and establishes causality ordering, but no further requirement. Actually it clearly shows that group A can not contain stores of any other CPUs than the one executing the barrier, and group B only contains operations of another CPU that appear after it loaded some data found already in group B. If the other CPU is concurrently performing only stores, they will be in neither A or B.
Perhaps your claim is less strong: barriers *may* be required to drain store queues on remote CPUs depending on what that remote CPU is currently doing (so in any case will have to be broadcast and checked by those remote CPUs).
Can you provide an example of how reordering of remote CPU must be constrained in the case of a barrier? I can't see where the need would be, and we know that Intel's x86 CPUs implements extremely strong ordering (of its locked instructions -- global sequential ordering) in a handful of cycles which would be impossible if it had to round-trip off-core let alone off-die. Which seems to provide a counterexample that such ordering can be implemented without having to potentially drain remote store queues. So what stronger ordering would require it?
Now perhaps your claim is even less strong that barriers may be required to drain store queues in the case of an implementation of the ISA which implements a weaker memory ordering than Haswell.
That would still be interesting to know of. I do recall a long ago conversation with an IBM guy who worked on POWER implementation talking about implementation of one of their barriers (sync, or perhaps eieio) being improved such that it no longer had to go to fabric, somewhere around POWER4 to POWER5 transition. Unfortunately I've long forgotten the details, and it may have been specific to MMIO concerns.