By: anon (anon.delete@this.anon.com), August 17, 2014 2:40 am
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on August 17, 2014 12:42 am wrote:
> anon (anon.delete@this.anon.com) on August 17, 2014 12:10 am wrote:
> > I'd like to see actual evidence of this. Of course we've
> > all seen graphs showing the opposite -- that release
> > consistency (weak ordering with release barriers) is more performant than strong and processor ordering.
> >
> Most of the evidence is anecdotal in nature. Some is this is because processors that are strongly ordered
> tend to spend some extra hardware resources to increase performance (most x86), second because the programmers
> tend to be extra cautious on weakly ordered architectures because they've been burned too much.
>
> > On the other hand, those have tended to be on academic papers, and the most recent one I've seen
> > is quite old. So I don't put a lot of stock in those, for modern CPUs and real applications.
> >
> As I said, this tends to be one of those areas where theory and practice
> don't exactly mesh. Academic papers tend to me more theory.
Right, but anecdotes and practice are notoriously unreliable as well. Even from experts, with no offense intended to you or Linus.
>
>
> > That said, I've never seen anywhere evidence (except anecdotal and not very convincing) evidence
> > of the opposite. Keep in mind that implementations can always choose to implement stronger ordering
> > than specified (I think I've read that many ARM implementations do this, for example). Possibly
> > that is done for compatibility and/or ease of hardware implementation. But if it were a performance
> > advantage, you would expect to see IBM going that way. Because they actually *do* support the option
> > for x86-like ordering in recent POWER CPUs, but they keep their weak ordering around.
> >
> Even if they implement stronger ordering than specified, it is just laying a mine field in the future. One
> example is the EV5 to EV6 transition, where EV5 was much more strict than the actual architecture which resulted
> in numerous hard to track down bugs when EV6 released. EV6 was WaA but the software wasn't.
The point is that there is no minefield in going the *other* way. Maybe they create a minefield if they ever want to switch back, but if it is unquestioningly better to go with stronger ordering, then they have no barrier to make the transition.
> > It would actually be an interesting study to run native powerpc code under strong ordering mode vs weak,
> > and examine the performance difference. Obviously it would just be a single data point, not evidence of
> > general case, because different levels of optimization probably went into it (Intel probably spent far
> > more effort to optimize processor ordering than IBM did!). But still it would be interesting, IMO.
> >
> One effect that contributes to the issue is that architects for weakly ordered
> MOMs tend to not devote resources to having the corner cases be fast.
Right, but (at least ideally), that's an opportunity cost. They have more resources to spend on having the regular cases be fast.
> > I'm not sure if I agree. It may be true that they tend to introduce more bugs, but really
> > if they don't know what they are doing, they really should just use synchronization libraries
> > and language features, and those should generally get the ordering details correct.
> >
> IIRC, Linus has commented on this topic in the past on this very board. They tend to be more conservative
> with weakly ordered architectures. AKA, they'll put in extra barriers than possibly needed just in case
> or because trying to figure every case is far too complicated. So when in doubt, use a barrier.
Not... really. I mean, in practice sometimes, but that should not really happen.
In core code (core kernel/compiler/library) doing that indicates that you do not understand what the data races are, and therefore your code is probably buggy *anyway*. Even if you hit it with a big barrier, there could easily be another race or synchronization bug lurking somewhere.
If peripheral (e.g., driver) or application code is doing that, then it indicates the synchronization APIs are inadequate.
I know Linux's driver APIs and potential memory ordering semantics were notoriously shit, which is why, indeed, you would get drivers failing to work on powerpc, or being utterly riddled with wmb() between every statement.
That could have largely been avoided with better APIs. Don't tell driver writers, "err, stores to cacheable and MMIO regions are unordered with respect to one another, so use barriers". Give them APIs and calls that match the steps of their process-flow when setting up driver operations. Begin DMA initiation. Release a lock protecting MMIO registers. etc.
> > I'm not saying that this always happens, but I've never heard of anybody programming
> > portable code and getting worried about a weakly ordered CPU and then declining to attempt
> > some optimization that they otherwise would have. People who don't understand this probably
> > don't understand either that x86 can reorder loads ahead of stores too.
> >
> It manifests itself generally as more barriers in code around anything that could be a corner
> case combined with weakly ordered CPUs generally handling barriers poorly. Strongly ordered
> CPUs effectively have barriers everywhere so the hardware tends to handle them much better.
mfence is heavyweight on x86 too, just saying. If they don't understand what they are doing, and the spectre of the weakly-ordered boogyman is haunting them and making them put barriers everywhere, then they probably did the x86 port a favor too, although they slowed it down in the process, they got rid of some obscure bugs in it too.
> > > So far, the
> > > performance/strictness relationship with MOMs has been a case of Theory != Practice.
> >
> > I'm really not sure that's the case, because it's very hard to know the details. We may not know whether
> > Intel spends 1% additional power on a larger store queue because in-order retirement of stores effectively
> > increases the amount of time stores have to remain. We don't know if speculative reordering of loads
> > costs them a small additional amount of complexity or power due to failed speculation, etc.
> >
> > We (for most values of 'we' that comment on this topic in this forum) see the software
> > side of it. Which *appears* like memory barriers are slow and annoying to use.
> >
> That's because barriers are generally slow and annoying to use and weakly ordered
> architectures need more of them.
And *that's* because you did not see the cost earlier in the code flow, because the ordering was relaxed. Intel's store buffer filled up and stalled on a remotely held cacheline 200 instructions ago, so you did not see the cost of the fence.
> Strongly ordered architectures need less barriers
> and because its effectively a common case, handle it much better.
>
>
> > And for the case of low level synchronization details, programmers be damned (to some extent). The average
> > app programmer will never even grasp x86 ordering rules anyway,
> > so it's a lost cause. Have them use language/library
> > features, and everyone is happy. The language and library programmers who are incapable of understanding
> > power rules, probably aren't the correct people to be writing the x86 code either.
> >
> Except even the "good" programmers aren't that "good" at dealing with memory order models and sync.
Right. So they should use libraries and languages. Seriously. Everyone goes on about making the programmer's life easier. The way to make it easier is to have them use those things instead of being cowboys and doing retarded shit. Actually that applies even to good programmers who do understand such things, because bugs are always inevitable, they should not venture outside of synchronization libraries until they need to. There is actually very little need to, until already much more fundamental problems are solved, such as how to split and recombine work between CPUs. And even then, the availability of language and library primitives to do the job have significantly increased such that, really, almost nobody should ever need or want to access shared memory without going through such features. Just don't do it.
The same goes for x86 too, by the way. Programmers should use synchronization libraries *everywhere*, including ones who know memory consistency quite well, because they *will* get subtle bugs.
> This has been true for almost forever and will most likely remain true for almost ever. It is
> a rather complex area with very very few people who are truly proficient at it. Even on the hardware
> side where the percentage of people who understand it all is much higher, we tend to rely quite
> heavily on formal models and formal proofs for this stuff cause it is so complex. The state space
> for this in a modern coherent system is absolutely massive. Adding a layer of software on top
> basically makes it impossible, so everyone has to err on the side of caution.
Which means going through synchronization APIs.
> anon (anon.delete@this.anon.com) on August 17, 2014 12:10 am wrote:
> > I'd like to see actual evidence of this. Of course we've
> > all seen graphs showing the opposite -- that release
> > consistency (weak ordering with release barriers) is more performant than strong and processor ordering.
> >
> Most of the evidence is anecdotal in nature. Some is this is because processors that are strongly ordered
> tend to spend some extra hardware resources to increase performance (most x86), second because the programmers
> tend to be extra cautious on weakly ordered architectures because they've been burned too much.
>
> > On the other hand, those have tended to be on academic papers, and the most recent one I've seen
> > is quite old. So I don't put a lot of stock in those, for modern CPUs and real applications.
> >
> As I said, this tends to be one of those areas where theory and practice
> don't exactly mesh. Academic papers tend to me more theory.
Right, but anecdotes and practice are notoriously unreliable as well. Even from experts, with no offense intended to you or Linus.
>
>
> > That said, I've never seen anywhere evidence (except anecdotal and not very convincing) evidence
> > of the opposite. Keep in mind that implementations can always choose to implement stronger ordering
> > than specified (I think I've read that many ARM implementations do this, for example). Possibly
> > that is done for compatibility and/or ease of hardware implementation. But if it were a performance
> > advantage, you would expect to see IBM going that way. Because they actually *do* support the option
> > for x86-like ordering in recent POWER CPUs, but they keep their weak ordering around.
> >
> Even if they implement stronger ordering than specified, it is just laying a mine field in the future. One
> example is the EV5 to EV6 transition, where EV5 was much more strict than the actual architecture which resulted
> in numerous hard to track down bugs when EV6 released. EV6 was WaA but the software wasn't.
The point is that there is no minefield in going the *other* way. Maybe they create a minefield if they ever want to switch back, but if it is unquestioningly better to go with stronger ordering, then they have no barrier to make the transition.
> > It would actually be an interesting study to run native powerpc code under strong ordering mode vs weak,
> > and examine the performance difference. Obviously it would just be a single data point, not evidence of
> > general case, because different levels of optimization probably went into it (Intel probably spent far
> > more effort to optimize processor ordering than IBM did!). But still it would be interesting, IMO.
> >
> One effect that contributes to the issue is that architects for weakly ordered
> MOMs tend to not devote resources to having the corner cases be fast.
Right, but (at least ideally), that's an opportunity cost. They have more resources to spend on having the regular cases be fast.
> > I'm not sure if I agree. It may be true that they tend to introduce more bugs, but really
> > if they don't know what they are doing, they really should just use synchronization libraries
> > and language features, and those should generally get the ordering details correct.
> >
> IIRC, Linus has commented on this topic in the past on this very board. They tend to be more conservative
> with weakly ordered architectures. AKA, they'll put in extra barriers than possibly needed just in case
> or because trying to figure every case is far too complicated. So when in doubt, use a barrier.
Not... really. I mean, in practice sometimes, but that should not really happen.
In core code (core kernel/compiler/library) doing that indicates that you do not understand what the data races are, and therefore your code is probably buggy *anyway*. Even if you hit it with a big barrier, there could easily be another race or synchronization bug lurking somewhere.
If peripheral (e.g., driver) or application code is doing that, then it indicates the synchronization APIs are inadequate.
I know Linux's driver APIs and potential memory ordering semantics were notoriously shit, which is why, indeed, you would get drivers failing to work on powerpc, or being utterly riddled with wmb() between every statement.
That could have largely been avoided with better APIs. Don't tell driver writers, "err, stores to cacheable and MMIO regions are unordered with respect to one another, so use barriers". Give them APIs and calls that match the steps of their process-flow when setting up driver operations. Begin DMA initiation. Release a lock protecting MMIO registers. etc.
> > I'm not saying that this always happens, but I've never heard of anybody programming
> > portable code and getting worried about a weakly ordered CPU and then declining to attempt
> > some optimization that they otherwise would have. People who don't understand this probably
> > don't understand either that x86 can reorder loads ahead of stores too.
> >
> It manifests itself generally as more barriers in code around anything that could be a corner
> case combined with weakly ordered CPUs generally handling barriers poorly. Strongly ordered
> CPUs effectively have barriers everywhere so the hardware tends to handle them much better.
mfence is heavyweight on x86 too, just saying. If they don't understand what they are doing, and the spectre of the weakly-ordered boogyman is haunting them and making them put barriers everywhere, then they probably did the x86 port a favor too, although they slowed it down in the process, they got rid of some obscure bugs in it too.
> > > So far, the
> > > performance/strictness relationship with MOMs has been a case of Theory != Practice.
> >
> > I'm really not sure that's the case, because it's very hard to know the details. We may not know whether
> > Intel spends 1% additional power on a larger store queue because in-order retirement of stores effectively
> > increases the amount of time stores have to remain. We don't know if speculative reordering of loads
> > costs them a small additional amount of complexity or power due to failed speculation, etc.
> >
> > We (for most values of 'we' that comment on this topic in this forum) see the software
> > side of it. Which *appears* like memory barriers are slow and annoying to use.
> >
> That's because barriers are generally slow and annoying to use and weakly ordered
> architectures need more of them.
And *that's* because you did not see the cost earlier in the code flow, because the ordering was relaxed. Intel's store buffer filled up and stalled on a remotely held cacheline 200 instructions ago, so you did not see the cost of the fence.
> Strongly ordered architectures need less barriers
> and because its effectively a common case, handle it much better.
>
>
> > And for the case of low level synchronization details, programmers be damned (to some extent). The average
> > app programmer will never even grasp x86 ordering rules anyway,
> > so it's a lost cause. Have them use language/library
> > features, and everyone is happy. The language and library programmers who are incapable of understanding
> > power rules, probably aren't the correct people to be writing the x86 code either.
> >
> Except even the "good" programmers aren't that "good" at dealing with memory order models and sync.
Right. So they should use libraries and languages. Seriously. Everyone goes on about making the programmer's life easier. The way to make it easier is to have them use those things instead of being cowboys and doing retarded shit. Actually that applies even to good programmers who do understand such things, because bugs are always inevitable, they should not venture outside of synchronization libraries until they need to. There is actually very little need to, until already much more fundamental problems are solved, such as how to split and recombine work between CPUs. And even then, the availability of language and library primitives to do the job have significantly increased such that, really, almost nobody should ever need or want to access shared memory without going through such features. Just don't do it.
The same goes for x86 too, by the way. Programmers should use synchronization libraries *everywhere*, including ones who know memory consistency quite well, because they *will* get subtle bugs.
> This has been true for almost forever and will most likely remain true for almost ever. It is
> a rather complex area with very very few people who are truly proficient at it. Even on the hardware
> side where the percentage of people who understand it all is much higher, we tend to rely quite
> heavily on formal models and formal proofs for this stuff cause it is so complex. The state space
> for this in a modern coherent system is absolutely massive. Adding a layer of software on top
> basically makes it impossible, so everyone has to err on the side of caution.
Which means going through synchronization APIs.