By: Konrad Schwarz (konrad.schwarz.delete@this.siemens.com), July 22, 2015 12:35 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on July 21, 2015 8:17 am wrote:
> This does the same thing as Power ISA as far as I can see, and establishes causality ordering,
> but no further requirement. Actually it clearly shows that group A can not contain stores
> of any other CPUs than the one executing the barrier, and group B only contains operations
> of another CPU that appear after it loaded some data found already in group B. If the other
> CPU is concurrently performing only stores, they will be in neither A or B.
>
> Perhaps your claim is less strong: barriers *may* be required to drain store
> queues on remote CPUs depending on what that remote CPU is currently doing (so
> in any case will have to be broadcast and checked by those remote CPUs).
Yes, this is my claim. Semantically, however, there is not a big difference:
barriers are global to the set of processors that have processed the
same data the the processor issuing the memory barrier has. This is the only
thing of interest to a multi-threaded program.
And, since there will be implementation limits on how well the system can track the
accesses performed by the processors, in practice many (if not all) barrier
instructions will not only be broadcast, but also acted upon globally.
> Can you provide an example of how reordering of remote CPU must be constrained in the case of
> a barrier?
Figure 5-2 in the "Programming Environments Manual for 32-Bit Implementations of the PowerPC Architecture, Rev. 3" provides such an example.
This figure shows the instructions issued by a processor P1: inst1 to inst11, with inst5 being
a sync, which is a cumulative memory barrier, and instructions issued by a processor P2,
instL, instM, up to instX.
G1, instructions ahead of the barrier, include inst1 to inst4,
and all instructions of P2 that affect P1 (i.e., because P1 loads a datum written by P2)
including instL to instO, assuming isntO happens before the sync instruction inst5.
G2, instructions after the barrier, include inst6 to inst11, and instP to instX, assuming
instP happens after inst5 and that P2 executes a load to a datum written by an instruction
in G2. (So G1 and G2 are defined as the transitive closure of these data dependencies.)
> I can't see where the need would be,
I think this definition reflect the most relaxed definition of coherency that allows
a causal ordering and suspect it is mostly a theoretical ideal -- that allows one to prove
the correctness of various shared memory "lock free" algorithms.
Practical implementations will be much simpler, e.g.,
simply cause all processors to flush their store queues.
> This does the same thing as Power ISA as far as I can see, and establishes causality ordering,
> but no further requirement. Actually it clearly shows that group A can not contain stores
> of any other CPUs than the one executing the barrier, and group B only contains operations
> of another CPU that appear after it loaded some data found already in group B. If the other
> CPU is concurrently performing only stores, they will be in neither A or B.
>
> Perhaps your claim is less strong: barriers *may* be required to drain store
> queues on remote CPUs depending on what that remote CPU is currently doing (so
> in any case will have to be broadcast and checked by those remote CPUs).
Yes, this is my claim. Semantically, however, there is not a big difference:
barriers are global to the set of processors that have processed the
same data the the processor issuing the memory barrier has. This is the only
thing of interest to a multi-threaded program.
And, since there will be implementation limits on how well the system can track the
accesses performed by the processors, in practice many (if not all) barrier
instructions will not only be broadcast, but also acted upon globally.
> Can you provide an example of how reordering of remote CPU must be constrained in the case of
> a barrier?
Figure 5-2 in the "Programming Environments Manual for 32-Bit Implementations of the PowerPC Architecture, Rev. 3" provides such an example.
This figure shows the instructions issued by a processor P1: inst1 to inst11, with inst5 being
a sync, which is a cumulative memory barrier, and instructions issued by a processor P2,
instL, instM, up to instX.
G1, instructions ahead of the barrier, include inst1 to inst4,
and all instructions of P2 that affect P1 (i.e., because P1 loads a datum written by P2)
including instL to instO, assuming isntO happens before the sync instruction inst5.
G2, instructions after the barrier, include inst6 to inst11, and instP to instX, assuming
instP happens after inst5 and that P2 executes a load to a datum written by an instruction
in G2. (So G1 and G2 are defined as the transitive closure of these data dependencies.)
> I can't see where the need would be,
I think this definition reflect the most relaxed definition of coherency that allows
a causal ordering and suspect it is mostly a theoretical ideal -- that allows one to prove
the correctness of various shared memory "lock free" algorithms.
Practical implementations will be much simpler, e.g.,
simply cause all processors to flush their store queues.