By: anon (anon.delete@this.anon.com), August 29, 2014 3:49 pm
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on August 29, 2014 7:35 am wrote:
> Howard Chu (hyc.delete@this.symas.com) on August 28, 2014 9:17 pm wrote:
> > Pretty good comparison of x86, PPC, and ARM here http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
> >
>
> That's a pretty excellent example of what I was talking about.
While I don't disagree that you could always find these kinds of cases -- exactly because the different ISA constraints allow optimization effort to be allocated differently -- I'm not *entirely* happy with the numbers.
I mean, they're interesting for what they are, but firstly, that powerpc core is an old core. It's a pentium4-era core. Now the pentium4 cores would probably do reasonably well on this test too, but nobody would say they don't have horrible performance cases as well, dispite being strongly ordered. Try an atomic operation and it would probably take hundreds of cycles.
Secondly, compare; branch; isync is not the nicest way to implement read-read ordering on powerpc. isync is not simply memory barrier so much as a hammer. It means basically end speculative execution until the isync instruction is executed. They put it after the branch which depends on the result of the first load, which effectively means no other loads (or anything else) will be executed until then.
64-bit powerpc ISA has an instruction called lwsync, which really should be used instead of cmp;bne;isync, although IBM's own code has used the isync style barrier for this in the past, I suspect that's something of a holdover from before lwsync was added. I note in 2.07 of the ISA that lwsync is suggested for acquire barrier when dealing with cacheable memory.
Now you take that loop and run it with lwsync on a POWER8, and I would not be surprised if it is still significantly slower than the i7, so while I don't think the numbers here are good, I agree with your idea that you would expect barriers to be more expensive on hardware implementing weak memory ordering. I just disagree that's necessarily a bad thing, and it's IMO impossible to draw that conclusion if you're just looking at a microbenchmark.
>
> Its why although a weakly ordered model is naively very attractive from a hardware perspective, I would
> still choose to make any new architecture strongly ordered and makes the strongly ordered cases fast.
> Weak ordering is deceptive. Because to make it robust, you have to handle all the hard cases which means
> you have to basically have all the hardware that you would have in a strongly ordered MOM. But because,
> for the majority of test/performance cases you'll look at, it doesn't matter, it will naturally be a low
> priority and you won't actually run into the real issues until you hit real hardware.
>
> AKA Strongly ordered MOM is better because its worse.
> Just another one of those lovely hardware paradoxes.
>
> This also carries over into atomic. LL/SC from a hardware perspective is easy/simple. But overall its worse
> than doing the full set of real atomics. In theory LL/SC is better and can do everything, but because its better/easier,
> it generally ends up worse. And as we get into the future of a large number of cores per chip and start getting
> software that is actually coherent on those cores, the advantages of the hard atomic and the capability to leverage
> them to the fabric will likely win out. There is no reason why a FETCH_ADD has do be done at the core. In
> fact, in many cases it makes sense to do the FETCH_ADD where the line is located. AKA, ship the atomic, not
> the line. And this doesn't just apply at the coherence level, it makes just as much sense at the message passing
> level. And in both cases, it should be a performance and power win.
>
> Howard Chu (hyc.delete@this.symas.com) on August 28, 2014 9:17 pm wrote:
> > Pretty good comparison of x86, PPC, and ARM here http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
> >
>
> That's a pretty excellent example of what I was talking about.
While I don't disagree that you could always find these kinds of cases -- exactly because the different ISA constraints allow optimization effort to be allocated differently -- I'm not *entirely* happy with the numbers.
I mean, they're interesting for what they are, but firstly, that powerpc core is an old core. It's a pentium4-era core. Now the pentium4 cores would probably do reasonably well on this test too, but nobody would say they don't have horrible performance cases as well, dispite being strongly ordered. Try an atomic operation and it would probably take hundreds of cycles.
Secondly, compare; branch; isync is not the nicest way to implement read-read ordering on powerpc. isync is not simply memory barrier so much as a hammer. It means basically end speculative execution until the isync instruction is executed. They put it after the branch which depends on the result of the first load, which effectively means no other loads (or anything else) will be executed until then.
64-bit powerpc ISA has an instruction called lwsync, which really should be used instead of cmp;bne;isync, although IBM's own code has used the isync style barrier for this in the past, I suspect that's something of a holdover from before lwsync was added. I note in 2.07 of the ISA that lwsync is suggested for acquire barrier when dealing with cacheable memory.
Now you take that loop and run it with lwsync on a POWER8, and I would not be surprised if it is still significantly slower than the i7, so while I don't think the numbers here are good, I agree with your idea that you would expect barriers to be more expensive on hardware implementing weak memory ordering. I just disagree that's necessarily a bad thing, and it's IMO impossible to draw that conclusion if you're just looking at a microbenchmark.
>
> Its why although a weakly ordered model is naively very attractive from a hardware perspective, I would
> still choose to make any new architecture strongly ordered and makes the strongly ordered cases fast.
> Weak ordering is deceptive. Because to make it robust, you have to handle all the hard cases which means
> you have to basically have all the hardware that you would have in a strongly ordered MOM. But because,
> for the majority of test/performance cases you'll look at, it doesn't matter, it will naturally be a low
> priority and you won't actually run into the real issues until you hit real hardware.
>
> AKA Strongly ordered MOM is better because its worse.
> Just another one of those lovely hardware paradoxes.
>
> This also carries over into atomic. LL/SC from a hardware perspective is easy/simple. But overall its worse
> than doing the full set of real atomics. In theory LL/SC is better and can do everything, but because its better/easier,
> it generally ends up worse. And as we get into the future of a large number of cores per chip and start getting
> software that is actually coherent on those cores, the advantages of the hard atomic and the capability to leverage
> them to the fabric will likely win out. There is no reason why a FETCH_ADD has do be done at the core. In
> fact, in many cases it makes sense to do the FETCH_ADD where the line is located. AKA, ship the atomic, not
> the line. And this doesn't just apply at the coherence level, it makes just as much sense at the message passing
> level. And in both cases, it should be a performance and power win.
>