By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), August 31, 2014 12:52 pm
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on August 31, 2014 4:54 am wrote:
> foobar (a.delete@this.b.c) on August 30, 2014 9:26 pm wrote:
> > anon (anon.delete@this.anon.com) on August 29, 2014 3:49 pm wrote:
> > > Aaron Spink (aaronspink.delete@this.notearthlink.net) on August 29, 2014 7:35 am wrote:
> > > > Howard Chu (hyc.delete@this.symas.com) on August 28, 2014 9:17 pm wrote:
> > > > > Pretty good comparison of x86, PPC, and ARM here http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
> > > > >
> > > >
> > > > That's a pretty excellent example of what I was talking about.
> > >
> > > While I don't disagree that you could always find these kinds of cases --
> > > exactly because the different ISA constraints allow optimization effort to
> > > be allocated differently -- I'm not *entirely* happy with the numbers.
> > >
> > > I mean, they're interesting for what they are, but firstly, that powerpc core is an old core.
> > > It's a pentium4-era core. Now the pentium4 cores would probably do reasonably well on this test
> > > too, but nobody would say they don't have horrible performance cases as well, dispite being strongly
> > > ordered. Try an atomic operation and it would probably take hundreds of cycles.
> > >
> > > Secondly, compare; branch; isync is not the nicest way to implement read-read ordering on powerpc. isync
> > > is not simply memory barrier so much as a hammer.
> >
> > For the PowerPC 4xx processors, isync flushes the shadow TLBs
> > as well. I bet it does the same thing for the ERATs in Power8.
>
> The figures in that paper are not at all so convincing for the case if you adjust for the speed of the ARM
> A9 at 850Mhz and the Core i7 at 2.3GHz as given by Geekbench. The Intel machine is about 10 times as fast.
> Thus Instead of the enormous difference of 0.81ns compared to 16.89ns you're talking about the ARM version
> being twice as slow relatively. Twice as slow or fast could be explained by practically anything.
The i7 runs at 3.3GHz turbo, so it is no surprise it can execute a loop dominated by memory accesses almost 15 times faster than an A9 at just 850MHz. If the comparison didn't use an underclocked A9 or a more modern ARM core it wouldn't look as skewed.
The 2nd optimized example shows how fast the loop is without any barriers, so that gives the baseline performance of each CPU. A better way to express the results is to convert to cycles:
The overhead on the PPC970 is huge, but then again it's a 2002 core. The overhead of dmb on Cortex-A9 is small in comparison. It's a shame he didn't test iPhone5, as AArch64 has special load-acquire and store-release which have a fraction of the overhead of dmb. Note he was wrong about compilers not supporting consume semantics, GCC certainly does.
Wilco
> foobar (a.delete@this.b.c) on August 30, 2014 9:26 pm wrote:
> > anon (anon.delete@this.anon.com) on August 29, 2014 3:49 pm wrote:
> > > Aaron Spink (aaronspink.delete@this.notearthlink.net) on August 29, 2014 7:35 am wrote:
> > > > Howard Chu (hyc.delete@this.symas.com) on August 28, 2014 9:17 pm wrote:
> > > > > Pretty good comparison of x86, PPC, and ARM here http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
> > > > >
> > > >
> > > > That's a pretty excellent example of what I was talking about.
> > >
> > > While I don't disagree that you could always find these kinds of cases --
> > > exactly because the different ISA constraints allow optimization effort to
> > > be allocated differently -- I'm not *entirely* happy with the numbers.
> > >
> > > I mean, they're interesting for what they are, but firstly, that powerpc core is an old core.
> > > It's a pentium4-era core. Now the pentium4 cores would probably do reasonably well on this test
> > > too, but nobody would say they don't have horrible performance cases as well, dispite being strongly
> > > ordered. Try an atomic operation and it would probably take hundreds of cycles.
> > >
> > > Secondly, compare; branch; isync is not the nicest way to implement read-read ordering on powerpc. isync
> > > is not simply memory barrier so much as a hammer.
> >
> > For the PowerPC 4xx processors, isync flushes the shadow TLBs
> > as well. I bet it does the same thing for the ERATs in Power8.
>
> The figures in that paper are not at all so convincing for the case if you adjust for the speed of the ARM
> A9 at 850Mhz and the Core i7 at 2.3GHz as given by Geekbench. The Intel machine is about 10 times as fast.
> Thus Instead of the enormous difference of 0.81ns compared to 16.89ns you're talking about the ARM version
> being twice as slow relatively. Twice as slow or fast could be explained by practically anything.
The i7 runs at 3.3GHz turbo, so it is no surprise it can execute a loop dominated by memory accesses almost 15 times faster than an A9 at just 850MHz. If the comparison didn't use an underclocked A9 or a more modern ARM core it wouldn't look as skewed.
The 2nd optimized example shows how fast the loop is without any barriers, so that gives the baseline performance of each CPU. A better way to express the results is to convert to cycles:
cpu acq fast (overhead of synchronization in cycles)
i7 - 2.6 2.4 (0)
970 - 40.7 7.7 (33)
A9 - 14.4 9.1 (5)
The overhead on the PPC970 is huge, but then again it's a 2002 core. The overhead of dmb on Cortex-A9 is small in comparison. It's a shame he didn't test iPhone5, as AArch64 has special load-acquire and store-release which have a fraction of the overhead of dmb. Note he was wrong about compilers not supporting consume semantics, GCC certainly does.
Wilco