By: anon (anon.delete@this.anon.com), July 23, 2015 3:17 am
Room: Moderated Discussions
Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 22, 2015 12:35 am wrote:
> anon (anon.delete@this.anon.com) on July 21, 2015 8:17 am wrote:
>
> > This does the same thing as Power ISA as far as I can see, and establishes causality ordering,
> > but no further requirement. Actually it clearly shows that group A can not contain stores
> > of any other CPUs than the one executing the barrier, and group B only contains operations
> > of another CPU that appear after it loaded some data found already in group B. If the other
> > CPU is concurrently performing only stores, they will be in neither A or B.
> >
> > Perhaps your claim is less strong: barriers *may* be required to drain store
> > queues on remote CPUs depending on what that remote CPU is currently doing (so
> > in any case will have to be broadcast and checked by those remote CPUs).
>
> Yes, this is my claim. Semantically, however, there is not a big difference:
> barriers are global to the set of processors that have processed the
> same data the the processor issuing the memory barrier has. This is the only
> thing of interest to a multi-threaded program.
>
> And, since there will be implementation limits on how well the system can track the
> accesses performed by the processors, in practice many (if not all) barrier
> instructions will not only be broadcast, but also acted upon globally.
Well, I don't think that is actually the case as a general statement as defined by the ISA. The unfenced memory ordering is strictly weaker than x86 (giving more leeway in implementation, and the barriers are no stronger than x86's lock prefix barriers. Since x86 can execute those instructions in, what, 20 or 30 cycles so it must not be going off chip.
I am not surprised if some implementations do "dumb" things like doing a broadcast. I've heard of some CPUs taking thousands of cycles to execute a barrier. But it does not seem like an ISA deficiency.
> anon (anon.delete@this.anon.com) on July 21, 2015 8:17 am wrote:
>
> > This does the same thing as Power ISA as far as I can see, and establishes causality ordering,
> > but no further requirement. Actually it clearly shows that group A can not contain stores
> > of any other CPUs than the one executing the barrier, and group B only contains operations
> > of another CPU that appear after it loaded some data found already in group B. If the other
> > CPU is concurrently performing only stores, they will be in neither A or B.
> >
> > Perhaps your claim is less strong: barriers *may* be required to drain store
> > queues on remote CPUs depending on what that remote CPU is currently doing (so
> > in any case will have to be broadcast and checked by those remote CPUs).
>
> Yes, this is my claim. Semantically, however, there is not a big difference:
> barriers are global to the set of processors that have processed the
> same data the the processor issuing the memory barrier has. This is the only
> thing of interest to a multi-threaded program.
>
> And, since there will be implementation limits on how well the system can track the
> accesses performed by the processors, in practice many (if not all) barrier
> instructions will not only be broadcast, but also acted upon globally.
Well, I don't think that is actually the case as a general statement as defined by the ISA. The unfenced memory ordering is strictly weaker than x86 (giving more leeway in implementation, and the barriers are no stronger than x86's lock prefix barriers. Since x86 can execute those instructions in, what, 20 or 30 cycles so it must not be going off chip.
I am not surprised if some implementations do "dumb" things like doing a broadcast. I've heard of some CPUs taking thousands of cycles to execute a barrier. But it does not seem like an ISA deficiency.