By: Konrad Schwarz (konrad.schwarz.delete@this.siemens.com), July 23, 2015 4:59 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on July 23, 2015 3:17 am wrote:
> Well, I don't think that is actually the case as a general statement as defined by the ISA.
How else would one explain the difference between "cumulative" vs. non-cumulative barriers?
> The unfenced memory ordering is strictly weaker than x86 (giving more leeway in implementation,
> and the barriers are no stronger than x86's lock prefix barriers. Since x86 can execute those
> instructions in, what, 20 or 30 cycles so it must not be going off chip.
ARMv7 barriers can encode different "sharability domains": full system, outer shareable, inner shareable, and non-shareable.
One example given in the reference manual for shareability domains is a system consisting of two "clusters" of processors, where inner shareable applies only to the processors within a cluster,
and outer shareable applies across both clusters. Clearly, there is some benefit from limiting the shareability domain and this must be communication costs.
So your 20--30 cycle number may only be correct for a single socket machine, but not for an Xeon-EP, let alone an SGI Origin.
> I am not surprised if some implementations do "dumb" things like doing a broadcast. I've heard of some
> CPUs taking thousands of cycles to execute a barrier.But it does not seem like an ISA deficiency.
No, I think it is an unavoidable cost of shared memory multiprocessing. This is why a
MPI cluster is a viable architecture for large machines.
> Well, I don't think that is actually the case as a general statement as defined by the ISA.
How else would one explain the difference between "cumulative" vs. non-cumulative barriers?
> The unfenced memory ordering is strictly weaker than x86 (giving more leeway in implementation,
> and the barriers are no stronger than x86's lock prefix barriers. Since x86 can execute those
> instructions in, what, 20 or 30 cycles so it must not be going off chip.
ARMv7 barriers can encode different "sharability domains": full system, outer shareable, inner shareable, and non-shareable.
One example given in the reference manual for shareability domains is a system consisting of two "clusters" of processors, where inner shareable applies only to the processors within a cluster,
and outer shareable applies across both clusters. Clearly, there is some benefit from limiting the shareability domain and this must be communication costs.
So your 20--30 cycle number may only be correct for a single socket machine, but not for an Xeon-EP, let alone an SGI Origin.
> I am not surprised if some implementations do "dumb" things like doing a broadcast. I've heard of some
> CPUs taking thousands of cycles to execute a barrier.But it does not seem like an ISA deficiency.
No, I think it is an unavoidable cost of shared memory multiprocessing. This is why a
MPI cluster is a viable architecture for large machines.