By: anon (anon.delete@this.anon.com), July 21, 2015 8:50 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on July 21, 2015 8:17 am wrote:
> anon (anon.delete@this.anon.com) on July 21, 2015 6:22 am wrote:
> > Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 21, 2015 12:08 am wrote:
> > > anon (anon.delete@this.anon.com) on July 20, 2015 7:29 am wrote:
> > > > Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 20, 2015 4:44 am wrote:
> > > > > Except that barrier operations are -- at least by default -- global: the store queues of all
> > > > > coherent CPUs are drained when a (global) barrier instruction is executed (by one CPU).
> > > >
> > > > Which CPUs and which barrier instructions might those be?
> > > >
> > >
> > > I know of Power(PC) and ARM.
> >
> > I don't believe that is the case for Power. Not sure about ARM, I don't know so much about it.
> >
> > From the Power ISA 2.07 manual, Book II 1.7:
> >
> >
> >
> > It always talks about storage access *with respect to* the processor that executed the barrier.
> > The extension to access by other processors than P1 I believe is specifying causality (notice
> > the first point says performed *with respect to P1* before the barrier).
> >
> > I can't find anything that would require an implementation to flush remote store queues
> > in response to barriers (particularly not lwsync, which orders accesses to cacheable memory),
> > but even access mmio/caching inhibited memory suggests you can't rely on barrier to affect
> > remote CPUs. E.g., in Book II, 1.6, in respect to caching inhibited storage:
> >
> >
> >
>
> I got the ARMv8-A reference. It seems to be similar to Power ISA.
>
> "[For DMB instruction] If the required shareability is Full system
> then the operation applies to all observers within the system."
>
> Now that may sound like the barrier operation should effectively be executed by all CPUs, but I don't believe
> that is the case. It's just saying that the set of *observers* can determine of the memory accesses.
>
>
>
> This does the same thing as Power ISA as far as I can see, and establishes causality ordering,
> but no further requirement. Actually it clearly shows that group A can not contain stores
> of any other CPUs than the one executing the barrier,
Sorry, this is not quite what I meant to say. Of course stores from other CPUs that are observed by the one executing the barrier are part of that group. But being observed bu another means (presumably by definition) that they can be or are removed from the store queue without affecting any thing further, therefore would not have any requirement to participate in this barrier.
I mean with respect to concurrent (i.e., not necessarily yet observed/observable) stores only.
> and group B only contains operations
> of another CPU that appear after it loaded some data found already in group B. If the other
> CPU is concurrently performing only stores, they will be in neither A or B.
>
> Perhaps your claim is less strong: barriers *may* be required to drain store
> queues on remote CPUs depending on what that remote CPU is currently doing (so
> in any case will have to be broadcast and checked by those remote CPUs).
>
> Can you provide an example of how reordering of remote CPU must be constrained in the case of
> a barrier? I can't see where the need would be, and we know that Intel's x86 CPUs implements
> extremely strong ordering (of its locked instructions -- global sequential ordering) in a handful
> of cycles which would be impossible if it had to round-trip off-core let alone off-die. Which
> seems to provide a counterexample that such ordering can be implemented without having to potentially
> drain remote store queues. So what stronger ordering would require it?
>
> Now perhaps your claim is even less strong that barriers may be required to drain store queues in
> the case of an implementation of the ISA which implements a weaker memory ordering than Haswell.
>
> That would still be interesting to know of. I do recall a long ago conversation with an IBM guy who worked
> on POWER implementation talking about implementation of one of their barriers (sync, or perhaps eieio)
> being improved such that it no longer had to go to fabric, somewhere around POWER4 to POWER5 transition.
> Unfortunately I've long forgotten the details, and it may have been specific to MMIO concerns.
>
> anon (anon.delete@this.anon.com) on July 21, 2015 6:22 am wrote:
> > Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 21, 2015 12:08 am wrote:
> > > anon (anon.delete@this.anon.com) on July 20, 2015 7:29 am wrote:
> > > > Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 20, 2015 4:44 am wrote:
> > > > > Except that barrier operations are -- at least by default -- global: the store queues of all
> > > > > coherent CPUs are drained when a (global) barrier instruction is executed (by one CPU).
> > > >
> > > > Which CPUs and which barrier instructions might those be?
> > > >
> > >
> > > I know of Power(PC) and ARM.
> >
> > I don't believe that is the case for Power. Not sure about ARM, I don't know so much about it.
> >
> > From the Power ISA 2.07 manual, Book II 1.7:
> >
> >
> > When a processor (P1) executes a Synchronize, eieio, or mbar instruction a memory barrier is created, which
> > orders applicable storage accesses pairwise, as follows. Let A be a set of storage accesses that includes
> > all storage accesses associated with instructions preceding the barrier-creating instruction, and let B be
> > a set of storage accesses that includes all storage accesses associated with instructions following the
> > barrier-creating instruction. For each applicable pair a
> > i ,b j of storage accesses such that a i is in A and
> > b j is in B, the memory barrier ensures that a i will be
> > performed with respect to any processor or mechanism,
> > to the extent required by the associated Memory Coherence Required
> > attributes, before b j is performed with respect
> > to that processor or mechanism. The ordering done by a memory
> > barrier is said to be "cumulative" if it also orders
> > storage accesses that are performed by processors and mechanisms other than P1, as follows.
> >
> > - A includes all applicable storage accesses by any such processor or mechanism that
> > have been performed with respect to P1 before the memory barrier is created.
> >
> > - B includes all applicable storage accesses by any such
> > processor or mechanism that are performed after a Load
> > instruction executed by that processor or mechanism has returned the value stored by a store that is in B.
> >
> >
> > It always talks about storage access *with respect to* the processor that executed the barrier.
> > The extension to access by other processors than P1 I believe is specifying causality (notice
> > the first point says performed *with respect to P1* before the barrier).
> >
> > I can't find anything that would require an implementation to flush remote store queues
> > in response to barriers (particularly not lwsync, which orders accesses to cacheable memory),
> > but even access mmio/caching inhibited memory suggests you can't rely on barrier to affect
> > remote CPUs. E.g., in Book II, 1.6, in respect to caching inhibited storage:
> >
> >
> > None of the memory barrier instructions prevent the combining of accesses from different processors.
> >
> >
>
> I got the ARMv8-A reference. It seems to be similar to Power ISA.
>
> "[For DMB instruction] If the required shareability is Full system
> then the operation applies to all observers within the system."
>
> Now that may sound like the barrier operation should effectively be executed by all CPUs, but I don't believe
> that is the case. It's just saying that the set of *observers* can determine of the memory accesses.
>
>
> A DMB creates two groups of memory accesses, Group A and Group B:
>
> Group A Contains:
> - All explicit memory accesses of the required access types from observers in the same
> required shareability domain as PEe that are observed by PEe before the DMB instruction.
> These accesses include any accesses of the required access types performed by PEe.
> - All loads of required access types from an observer PEx in the same required shareability domain
> as PEe that have been observed by any given different observer, PEy, in the same required shareability
> domain as PEe before PEy has performed a memory access that is a member of Group A.
>
> Group B Contains:
> - All explicit memory accesses of the required access types by
> PEe that occur in program order after the DMB instruction.
> - All explicit memory accesses of the required access types by any given observer
> PEx in the same required shareability domain as PEe that can only occur after a
> load by PEx has returned the result of a store that is a member of Group B.
>
> Any observer with the same required shareability domain as PEe observes all members of Group A before it
> observes any member of Group B to the extent that those group members are required to be observed, as determined
> by the shareability and cacheability of the memory locations accessed by the group members.
>
>
> This does the same thing as Power ISA as far as I can see, and establishes causality ordering,
> but no further requirement. Actually it clearly shows that group A can not contain stores
> of any other CPUs than the one executing the barrier,
Sorry, this is not quite what I meant to say. Of course stores from other CPUs that are observed by the one executing the barrier are part of that group. But being observed bu another means (presumably by definition) that they can be or are removed from the store queue without affecting any thing further, therefore would not have any requirement to participate in this barrier.
I mean with respect to concurrent (i.e., not necessarily yet observed/observable) stores only.
> and group B only contains operations
> of another CPU that appear after it loaded some data found already in group B. If the other
> CPU is concurrently performing only stores, they will be in neither A or B.
>
> Perhaps your claim is less strong: barriers *may* be required to drain store
> queues on remote CPUs depending on what that remote CPU is currently doing (so
> in any case will have to be broadcast and checked by those remote CPUs).
>
> Can you provide an example of how reordering of remote CPU must be constrained in the case of
> a barrier? I can't see where the need would be, and we know that Intel's x86 CPUs implements
> extremely strong ordering (of its locked instructions -- global sequential ordering) in a handful
> of cycles which would be impossible if it had to round-trip off-core let alone off-die. Which
> seems to provide a counterexample that such ordering can be implemented without having to potentially
> drain remote store queues. So what stronger ordering would require it?
>
> Now perhaps your claim is even less strong that barriers may be required to drain store queues in
> the case of an implementation of the ISA which implements a weaker memory ordering than Haswell.
>
> That would still be interesting to know of. I do recall a long ago conversation with an IBM guy who worked
> on POWER implementation talking about implementation of one of their barriers (sync, or perhaps eieio)
> being improved such that it no longer had to go to fabric, somewhere around POWER4 to POWER5 transition.
> Unfortunately I've long forgotten the details, and it may have been specific to MMIO concerns.
>