By: Michael S (already5chosen.delete@this.yahoo.com), July 23, 2015 7:31 am
Room: Moderated Discussions
Konrad Schwarz (konrad.schwarz.delete@this.siemens.com) on July 23, 2015 4:59 am wrote:
> anon (anon.delete@this.anon.com) on July 23, 2015 3:17 am wrote:
> > Well, I don't think that is actually the case as a general statement as defined by the ISA.
>
> How else would one explain the difference between "cumulative" vs. non-cumulative barriers?
>
> > The unfenced memory ordering is strictly weaker than x86 (giving more leeway in implementation,
> > and the barriers are no stronger than x86's lock prefix barriers. Since x86 can execute those
> > instructions in, what, 20 or 30 cycles so it must not be going off chip.
>
> ARMv7 barriers can encode different "sharability domains": full
> system, outer shareable, inner shareable, and non-shareable.
>
> One example given in the reference manual for shareability domains is a system consisting of two
> "clusters" of processors, where inner shareable applies only to the processors within a cluster,
> and outer shareable applies across both clusters. Clearly, there is some benefit
> from limiting the shareability domain and this must be communication costs.
>
> So your 20--30 cycle number may only be correct for a single socket
> machine, but not for an Xeon-EP, let alone an SGI Origin.
>
I don't think you are correct about it.
Xeon-EP is already "almost-SC", even without Interlocked barrier and SGI Origin (MIPS 1xK) is already SC without any memory barriers.
Which appears to imply that on these systems all potentially globally relevant coherence state transitions already has total order.
So, for Origin it seems super-obvious that barrier does not have to be broadcasted.
For Xeon-EP it's less obvious, but I think it's true as well.
According to my understanding, on Xeon-EP all interlocked operations are totally ordered relatively to any other interlocked operations, related or not. The are also totally ordered relatively to any loads or stores on the same processor and relatively to all accesses to locked memory locations on all processors in coherent domain. However they don't have to establish tatal order relatively to completely unrelated memory operations on unrelated processors, i.e. on those unrelated processors load-promotion and forwarding of not yet globally visible stores are still allowed and don't have to be disturbed. So, as long as the processor that executes interlocked OP has the cache line in question in modified or exclusive state, I don't see why it will need to do anything non-local.
> > I am not surprised if some implementations do "dumb" things like doing a broadcast. I've heard of some
> > CPUs taking thousands of cycles to execute a barrier.But it does not seem like an ISA deficiency.
>
> No, I think it is an unavoidable cost of shared memory multiprocessing. This is why a
> MPI cluster is a viable architecture for large machines.
>
> anon (anon.delete@this.anon.com) on July 23, 2015 3:17 am wrote:
> > Well, I don't think that is actually the case as a general statement as defined by the ISA.
>
> How else would one explain the difference between "cumulative" vs. non-cumulative barriers?
>
> > The unfenced memory ordering is strictly weaker than x86 (giving more leeway in implementation,
> > and the barriers are no stronger than x86's lock prefix barriers. Since x86 can execute those
> > instructions in, what, 20 or 30 cycles so it must not be going off chip.
>
> ARMv7 barriers can encode different "sharability domains": full
> system, outer shareable, inner shareable, and non-shareable.
>
> One example given in the reference manual for shareability domains is a system consisting of two
> "clusters" of processors, where inner shareable applies only to the processors within a cluster,
> and outer shareable applies across both clusters. Clearly, there is some benefit
> from limiting the shareability domain and this must be communication costs.
>
> So your 20--30 cycle number may only be correct for a single socket
> machine, but not for an Xeon-EP, let alone an SGI Origin.
>
I don't think you are correct about it.
Xeon-EP is already "almost-SC", even without Interlocked barrier and SGI Origin (MIPS 1xK) is already SC without any memory barriers.
Which appears to imply that on these systems all potentially globally relevant coherence state transitions already has total order.
So, for Origin it seems super-obvious that barrier does not have to be broadcasted.
For Xeon-EP it's less obvious, but I think it's true as well.
According to my understanding, on Xeon-EP all interlocked operations are totally ordered relatively to any other interlocked operations, related or not. The are also totally ordered relatively to any loads or stores on the same processor and relatively to all accesses to locked memory locations on all processors in coherent domain. However they don't have to establish tatal order relatively to completely unrelated memory operations on unrelated processors, i.e. on those unrelated processors load-promotion and forwarding of not yet globally visible stores are still allowed and don't have to be disturbed. So, as long as the processor that executes interlocked OP has the cache line in question in modified or exclusive state, I don't see why it will need to do anything non-local.
> > I am not surprised if some implementations do "dumb" things like doing a broadcast. I've heard of some
> > CPUs taking thousands of cycles to execute a barrier.But it does not seem like an ISA deficiency.
>
> No, I think it is an unavoidable cost of shared memory multiprocessing. This is why a
> MPI cluster is a viable architecture for large machines.
>