By: Aaron Spink (aaronspink.delete@this.notearthlink.net), August 29, 2014 7:35 am
Room: Moderated Discussions
Howard Chu (hyc.delete@this.symas.com) on August 28, 2014 9:17 pm wrote:
> Pretty good comparison of x86, PPC, and ARM here http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
>
That's a pretty excellent example of what I was talking about.
Its why although a weakly ordered model is naively very attractive from a hardware perspective, I would still choose to make any new architecture strongly ordered and makes the strongly ordered cases fast. Weak ordering is deceptive. Because to make it robust, you have to handle all the hard cases which means you have to basically have all the hardware that you would have in a strongly ordered MOM. But because, for the majority of test/performance cases you'll look at, it doesn't matter, it will naturally be a low priority and you won't actually run into the real issues until you hit real hardware.
AKA Strongly ordered MOM is better because its worse. Just another one of those lovely hardware paradoxes.
This also carries over into atomic. LL/SC from a hardware perspective is easy/simple. But overall its worse than doing the full set of real atomics. In theory LL/SC is better and can do everything, but because its better/easier, it generally ends up worse. And as we get into the future of a large number of cores per chip and start getting software that is actually coherent on those cores, the advantages of the hard atomic and the capability to leverage them to the fabric will likely win out. There is no reason why a FETCH_ADD has do be done at the core. In fact, in many cases it makes sense to do the FETCH_ADD where the line is located. AKA, ship the atomic, not the line. And this doesn't just apply at the coherence level, it makes just as much sense at the message passing level. And in both cases, it should be a performance and power win.
> Pretty good comparison of x86, PPC, and ARM here http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
>
That's a pretty excellent example of what I was talking about.
Its why although a weakly ordered model is naively very attractive from a hardware perspective, I would still choose to make any new architecture strongly ordered and makes the strongly ordered cases fast. Weak ordering is deceptive. Because to make it robust, you have to handle all the hard cases which means you have to basically have all the hardware that you would have in a strongly ordered MOM. But because, for the majority of test/performance cases you'll look at, it doesn't matter, it will naturally be a low priority and you won't actually run into the real issues until you hit real hardware.
AKA Strongly ordered MOM is better because its worse. Just another one of those lovely hardware paradoxes.
This also carries over into atomic. LL/SC from a hardware perspective is easy/simple. But overall its worse than doing the full set of real atomics. In theory LL/SC is better and can do everything, but because its better/easier, it generally ends up worse. And as we get into the future of a large number of cores per chip and start getting software that is actually coherent on those cores, the advantages of the hard atomic and the capability to leverage them to the fabric will likely win out. There is no reason why a FETCH_ADD has do be done at the core. In fact, in many cases it makes sense to do the FETCH_ADD where the line is located. AKA, ship the atomic, not the line. And this doesn't just apply at the coherence level, it makes just as much sense at the message passing level. And in both cases, it should be a performance and power win.