By: Aaron Spink (aaronspink.delete@this.notearthlink.net), August 17, 2014 12:42 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on August 17, 2014 12:10 am wrote:
> I'd like to see actual evidence of this. Of course we've all seen graphs showing the opposite -- that release
> consistency (weak ordering with release barriers) is more performant than strong and processor ordering.
>
Most of the evidence is anecdotal in nature. Some is this is because processors that are strongly ordered tend to spend some extra hardware resources to increase performance (most x86), second because the programmers tend to be extra cautious on weakly ordered architectures because they've been burned too much.
> On the other hand, those have tended to be on academic papers, and the most recent one I've seen
> is quite old. So I don't put a lot of stock in those, for modern CPUs and real applications.
>
As I said, this tends to be one of those areas where theory and practice don't exactly mesh. Academic papers tend to me more theory.
> That said, I've never seen anywhere evidence (except anecdotal and not very convincing) evidence
> of the opposite. Keep in mind that implementations can always choose to implement stronger ordering
> than specified (I think I've read that many ARM implementations do this, for example). Possibly
> that is done for compatibility and/or ease of hardware implementation. But if it were a performance
> advantage, you would expect to see IBM going that way. Because they actually *do* support the option
> for x86-like ordering in recent POWER CPUs, but they keep their weak ordering around.
>
Even if they implement stronger ordering than specified, it is just laying a mine field in the future. One example is the EV5 to EV6 transition, where EV5 was much more strict than the actual architecture which resulted in numerous hard to track down bugs when EV6 released. EV6 was WaA but the software wasn't.
> It would actually be an interesting study to run native powerpc code under strong ordering mode vs weak,
> and examine the performance difference. Obviously it would just be a single data point, not evidence of
> general case, because different levels of optimization probably went into it (Intel probably spent far
> more effort to optimize processor ordering than IBM did!). But still it would be interesting, IMO.
>
One effect that contributes to the issue is that architects for weakly ordered MOMs tend to not devote resources to having the corner cases be fast.
> I'm not sure if I agree. It may be true that they tend to introduce more bugs, but really
> if they don't know what they are doing, they really should just use synchronization libraries
> and language features, and those should generally get the ordering details correct.
>
IIRC, Linus has commented on this topic in the past on this very board. They tend to be more conservative with weakly ordered architectures. AKA, they'll put in extra barriers than possibly needed just in case or because trying to figure every case is far too complicated. So when in doubt, use a barrier.
> I'm not saying that this always happens, but I've never heard of anybody programming
> portable code and getting worried about a weakly ordered CPU and then declining to attempt
> some optimization that they otherwise would have. People who don't understand this probably
> don't understand either that x86 can reorder loads ahead of stores too.
>
It manifests itself generally as more barriers in code around anything that could be a corner case combined with weakly ordered CPUs generally handling barriers poorly. Strongly ordered CPUs effectively have barriers everywhere so the hardware tends to handle them much better.
> > So far, the
> > performance/strictness relationship with MOMs has been a case of Theory != Practice.
>
> I'm really not sure that's the case, because it's very hard to know the details. We may not know whether
> Intel spends 1% additional power on a larger store queue because in-order retirement of stores effectively
> increases the amount of time stores have to remain. We don't know if speculative reordering of loads
> costs them a small additional amount of complexity or power due to failed speculation, etc.
>
> We (for most values of 'we' that comment on this topic in this forum) see the software
> side of it. Which *appears* like memory barriers are slow and annoying to use.
>
That's because barriers are generally slow and annoying to use and weakly ordered architectures need more of them. Strongly ordered architectures need less barriers and because its effectively a common case, handle it much better.
> And for the case of low level synchronization details, programmers be damned (to some extent). The average
> app programmer will never even grasp x86 ordering rules anyway, so it's a lost cause. Have them use language/library
> features, and everyone is happy. The language and library programmers who are incapable of understanding
> power rules, probably aren't the correct people to be writing the x86 code either.
>
Except even the "good" programmers aren't that "good" at dealing with memory order models and sync. This has been true for almost forever and will most likely remain true for almost ever. It is a rather complex area with very very few people who are truly proficient at it. Even on the hardware side where the percentage of people who understand it all is much higher, we tend to rely quite heavily on formal models and formal proofs for this stuff cause it is so complex. The state space for this in a modern coherent system is absolutely massive. Adding a layer of software on top basically makes it impossible, so everyone has to err on the side of caution.
> I'd like to see actual evidence of this. Of course we've all seen graphs showing the opposite -- that release
> consistency (weak ordering with release barriers) is more performant than strong and processor ordering.
>
Most of the evidence is anecdotal in nature. Some is this is because processors that are strongly ordered tend to spend some extra hardware resources to increase performance (most x86), second because the programmers tend to be extra cautious on weakly ordered architectures because they've been burned too much.
> On the other hand, those have tended to be on academic papers, and the most recent one I've seen
> is quite old. So I don't put a lot of stock in those, for modern CPUs and real applications.
>
As I said, this tends to be one of those areas where theory and practice don't exactly mesh. Academic papers tend to me more theory.
> That said, I've never seen anywhere evidence (except anecdotal and not very convincing) evidence
> of the opposite. Keep in mind that implementations can always choose to implement stronger ordering
> than specified (I think I've read that many ARM implementations do this, for example). Possibly
> that is done for compatibility and/or ease of hardware implementation. But if it were a performance
> advantage, you would expect to see IBM going that way. Because they actually *do* support the option
> for x86-like ordering in recent POWER CPUs, but they keep their weak ordering around.
>
Even if they implement stronger ordering than specified, it is just laying a mine field in the future. One example is the EV5 to EV6 transition, where EV5 was much more strict than the actual architecture which resulted in numerous hard to track down bugs when EV6 released. EV6 was WaA but the software wasn't.
> It would actually be an interesting study to run native powerpc code under strong ordering mode vs weak,
> and examine the performance difference. Obviously it would just be a single data point, not evidence of
> general case, because different levels of optimization probably went into it (Intel probably spent far
> more effort to optimize processor ordering than IBM did!). But still it would be interesting, IMO.
>
One effect that contributes to the issue is that architects for weakly ordered MOMs tend to not devote resources to having the corner cases be fast.
> I'm not sure if I agree. It may be true that they tend to introduce more bugs, but really
> if they don't know what they are doing, they really should just use synchronization libraries
> and language features, and those should generally get the ordering details correct.
>
IIRC, Linus has commented on this topic in the past on this very board. They tend to be more conservative with weakly ordered architectures. AKA, they'll put in extra barriers than possibly needed just in case or because trying to figure every case is far too complicated. So when in doubt, use a barrier.
> I'm not saying that this always happens, but I've never heard of anybody programming
> portable code and getting worried about a weakly ordered CPU and then declining to attempt
> some optimization that they otherwise would have. People who don't understand this probably
> don't understand either that x86 can reorder loads ahead of stores too.
>
It manifests itself generally as more barriers in code around anything that could be a corner case combined with weakly ordered CPUs generally handling barriers poorly. Strongly ordered CPUs effectively have barriers everywhere so the hardware tends to handle them much better.
> > So far, the
> > performance/strictness relationship with MOMs has been a case of Theory != Practice.
>
> I'm really not sure that's the case, because it's very hard to know the details. We may not know whether
> Intel spends 1% additional power on a larger store queue because in-order retirement of stores effectively
> increases the amount of time stores have to remain. We don't know if speculative reordering of loads
> costs them a small additional amount of complexity or power due to failed speculation, etc.
>
> We (for most values of 'we' that comment on this topic in this forum) see the software
> side of it. Which *appears* like memory barriers are slow and annoying to use.
>
That's because barriers are generally slow and annoying to use and weakly ordered architectures need more of them. Strongly ordered architectures need less barriers and because its effectively a common case, handle it much better.
> And for the case of low level synchronization details, programmers be damned (to some extent). The average
> app programmer will never even grasp x86 ordering rules anyway, so it's a lost cause. Have them use language/library
> features, and everyone is happy. The language and library programmers who are incapable of understanding
> power rules, probably aren't the correct people to be writing the x86 code either.
>
Except even the "good" programmers aren't that "good" at dealing with memory order models and sync. This has been true for almost forever and will most likely remain true for almost ever. It is a rather complex area with very very few people who are truly proficient at it. Even on the hardware side where the percentage of people who understand it all is much higher, we tend to rely quite heavily on formal models and formal proofs for this stuff cause it is so complex. The state space for this in a modern coherent system is absolutely massive. Adding a layer of software on top basically makes it impossible, so everyone has to err on the side of caution.