By: anon (anon.delete@this.anon.com), August 16, 2014 11:10 pm
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on August 16, 2014 10:25 pm wrote:
> Maynard Handley (name99.delete@this.name99.org) on August 16, 2014 2:24 pm wrote:
> > Likewise for synchronization primitives. The consensus as I read the literature is
> > that load locked/store conditional is substantially easier to implement and get right
> > than LOCK prefixes and the random mix of other things that Intel has. I'm guessing
> > it's then also easier to build HW TM on top of load locked/store conditional.
> > Beyond this, I'm guessing it's substantially harder to design for and verify
> > the Intel memory model than the looser POWER and ARM memory models.
> >
> Actually, many software people have come to mostly loathe LL/SC and its been
> involved an lots of bugs of either software or hardware over the years. The
> truth is that LL/SC is pretty much only used for effectively CMPXHG.
>
> And its generally easier to design and verify for the x86 MOM than for any of the more relaxed
> memory models. Esp from a software perspective. And from a performance perspective, because of
> the software issues, the stricter MOMs tend to have better performance.
I'd like to see actual evidence of this. Of course we've all seen graphs showing the opposite -- that release consistency (weak ordering with release barriers) is more performant than strong and processor ordering.
On the other hand, those have tended to be on academic papers, and the most recent one I've seen is quite old. So I don't put a lot of stock in those, for modern CPUs and real applications.
That said, I've never seen anywhere evidence (except anecdotal and not very convincing) evidence of the opposite. Keep in mind that implementations can always choose to implement stronger ordering than specified (I think I've read that many ARM implementations do this, for example). Possibly that is done for compatibility and/or ease of hardware implementation. But if it were a performance advantage, you would expect to see IBM going that way. Because they actually *do* support the option for x86-like ordering in recent POWER CPUs, but they keep their weak ordering around.
It would actually be an interesting study to run native powerpc code under strong ordering mode vs weak, and examine the performance difference. Obviously it would just be a single data point, not evidence of general case, because different levels of optimization probably went into it (Intel probably spent far more effort to optimize processor ordering than IBM did!). But still it would be interesting, IMO.
> AKA with a weaker MOM,
> software developers tend to be much more cautious which leads to lower performance.
I'm not sure if I agree. It may be true that they tend to introduce more bugs, but really if they don't know what they are doing, they really should just use synchronization libraries and language features, and those should generally get the ordering details correct.
I'm not saying that this always happens, but I've never heard of anybody programming portable code and getting worried about a weakly ordered CPU and then declining to attempt some optimization that they otherwise would have. People who don't understand this probably don't understand either that x86 can reorder loads ahead of stores too.
> So far, the
> performance/strictness relationship with MOMs has been a case of Theory != Practice.
I'm really not sure that's the case, because it's very hard to know the details. We may not know whether Intel spends 1% additional power on a larger store queue because in-order retirement of stores effectively increases the amount of time stores have to remain. We don't know if speculative reordering of loads costs them a small additional amount of complexity or power due to failed speculation, etc.
We (for most values of 'we' that comment on this topic in this forum) see the software side of it. Which *appears* like memory barriers are slow and annoying to use.
>
> Maynard, you've been around long enough that you've almost certainly seen Linus
> rant #X on this topic. This is definitely one of those areas where I agree with
> Linus. Make it easier for the programmers, hardware designers be dammed.
Often I would agree. However clearly there is a grey area. Intel allows reordering of loads ahead of stores for hardware performance, and was very reluctant to codify the rest of their ordering rules for a long time, presumably until they decided that they could live with (or work around) the costs of providing stronger ordering on load/load and store/store. If it was such a walk in the park and very beneficial for programmers, they should have done that years before they did.
And for the case of low level synchronization details, programmers be damned (to some extent). The average app programmer will never even grasp x86 ordering rules anyway, so it's a lost cause. Have them use language/library features, and everyone is happy. The language and library programmers who are incapable of understanding power rules, probably aren't the correct people to be writing the x86 code either.
> Maynard Handley (name99.delete@this.name99.org) on August 16, 2014 2:24 pm wrote:
> > Likewise for synchronization primitives. The consensus as I read the literature is
> > that load locked/store conditional is substantially easier to implement and get right
> > than LOCK prefixes and the random mix of other things that Intel has. I'm guessing
> > it's then also easier to build HW TM on top of load locked/store conditional.
> > Beyond this, I'm guessing it's substantially harder to design for and verify
> > the Intel memory model than the looser POWER and ARM memory models.
> >
> Actually, many software people have come to mostly loathe LL/SC and its been
> involved an lots of bugs of either software or hardware over the years. The
> truth is that LL/SC is pretty much only used for effectively CMPXHG.
>
> And its generally easier to design and verify for the x86 MOM than for any of the more relaxed
> memory models. Esp from a software perspective. And from a performance perspective, because of
> the software issues, the stricter MOMs tend to have better performance.
I'd like to see actual evidence of this. Of course we've all seen graphs showing the opposite -- that release consistency (weak ordering with release barriers) is more performant than strong and processor ordering.
On the other hand, those have tended to be on academic papers, and the most recent one I've seen is quite old. So I don't put a lot of stock in those, for modern CPUs and real applications.
That said, I've never seen anywhere evidence (except anecdotal and not very convincing) evidence of the opposite. Keep in mind that implementations can always choose to implement stronger ordering than specified (I think I've read that many ARM implementations do this, for example). Possibly that is done for compatibility and/or ease of hardware implementation. But if it were a performance advantage, you would expect to see IBM going that way. Because they actually *do* support the option for x86-like ordering in recent POWER CPUs, but they keep their weak ordering around.
It would actually be an interesting study to run native powerpc code under strong ordering mode vs weak, and examine the performance difference. Obviously it would just be a single data point, not evidence of general case, because different levels of optimization probably went into it (Intel probably spent far more effort to optimize processor ordering than IBM did!). But still it would be interesting, IMO.
> AKA with a weaker MOM,
> software developers tend to be much more cautious which leads to lower performance.
I'm not sure if I agree. It may be true that they tend to introduce more bugs, but really if they don't know what they are doing, they really should just use synchronization libraries and language features, and those should generally get the ordering details correct.
I'm not saying that this always happens, but I've never heard of anybody programming portable code and getting worried about a weakly ordered CPU and then declining to attempt some optimization that they otherwise would have. People who don't understand this probably don't understand either that x86 can reorder loads ahead of stores too.
> So far, the
> performance/strictness relationship with MOMs has been a case of Theory != Practice.
I'm really not sure that's the case, because it's very hard to know the details. We may not know whether Intel spends 1% additional power on a larger store queue because in-order retirement of stores effectively increases the amount of time stores have to remain. We don't know if speculative reordering of loads costs them a small additional amount of complexity or power due to failed speculation, etc.
We (for most values of 'we' that comment on this topic in this forum) see the software side of it. Which *appears* like memory barriers are slow and annoying to use.
>
> Maynard, you've been around long enough that you've almost certainly seen Linus
> rant #X on this topic. This is definitely one of those areas where I agree with
> Linus. Make it easier for the programmers, hardware designers be dammed.
Often I would agree. However clearly there is a grey area. Intel allows reordering of loads ahead of stores for hardware performance, and was very reluctant to codify the rest of their ordering rules for a long time, presumably until they decided that they could live with (or work around) the costs of providing stronger ordering on load/load and store/store. If it was such a walk in the park and very beneficial for programmers, they should have done that years before they did.
And for the case of low level synchronization details, programmers be damned (to some extent). The average app programmer will never even grasp x86 ordering rules anyway, so it's a lost cause. Have them use language/library features, and everyone is happy. The language and library programmers who are incapable of understanding power rules, probably aren't the correct people to be writing the x86 code either.