By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), July 12, 2015 1:07 pm
Room: Moderated Discussions
EduardoS (no.delete@this.spam.com) on July 12, 2015 11:53 am wrote:
> >
> > See? No memory barriers. No nothing. Just that same model of "load early and mark".
>
> As a programmer I think it is easy too, but nobody does it, not even x86.
Actually, from what I can tell, that's pretty much exactly what modern big-core Intel CPU's are doing, it's just that write buffering is explicitly visible to software and thus reordering reads vs earlier local writes happens without that whole extra checking.
Basically, delaying a store isn't considered "speculatively reordering" it. So no, x86 is not sequentially consistent, because of the delayed stores.
Some people call it "x86-TSO".
So the example I gave was designed to illustrate the point about re-ordering and how barriers make sense - and how they don't. Not so much designed to show what x86 does.
The normal x86 model lets loads go ahead of local stores without any extra work (or put another way: the stores can be buffered over later loads). And that's actually the only really common case that you really absolutely need to re-order very aggressively for good performance.
So x86 cheats and doesn't do full sequential ordering. It does what it does for the usual historical reasons: there was always a store buffer, so the whole "stores can be delayed" has been there since day one. Even back when there were no caches at all, the store buffer still meant that stores would be delayed, and that was "visible" both for DMA and for SMP (yes, some people - notably Sequent - did SMP on 80386s with no caches).
So the x86 "delayed but in-order stores" is not some kind of smart design by superhuman minds that knew it was a good idea, it's a historical accident like pretty much all tech issues are. Then when caches got added, so as not to break anything (since there were applications that would break, even back then), everything else was done fairly strictly ordered. Which doesn't really say much, since the cores were in-order, the caches weren't horribly aggressive (I think they were originally blocking), and it was just all on a common frontside bus anyway. So there was a lot of inherent ordering there.
With the P6 and OoO you had the first situation where Intel really almost could get some ordering issues, and they did in fact have a few SMP bugs.
But, when pretty much everybody else says "we'll re-order anything against anything" (ie weak ordering) because they thought they were clever and had learnt from other peoples mistakes, x86 for historical reasons basically said "we'll only buffer writes".
And it turns out that write buffering is like 90% of all the re-ordering you need for performance anyway, and is one of the simpler concepts for software (and programmers tiny little minds) to handle, so it's a fairly reasonable engineering trade-off. So the clever people who wanted to reorder more aggressively turned out to not be that clever after all.
Because in contrast to delaying stores, things like letting stores go early before earlier loads makes no sense at all, and re-ordering stores against other stores is pretty dubious too. It just doesn't help that much. Load-vs-load reordering gets you some performance, but as I tried to outline, you can get that without requiring barriers, and it gets more complex for software.
So the things that weak memory ordering allows (over TSO) aren't actually all that helpful, and they do hurt software.
So the "weak memory ordering" people made a mistake. They wanted a "clean" architecture and not those ugly arbitrary rules where only one kind of re-ordering is done. This is where the alpha really shone - the cleanest of them all, and the most broken of them all. That was basically one big "fuck you" to sanity, saying that if you allow one kind of reordering, you should damn well allow anything at all. Never mind that that "one kind" was the one really sane and important one, and the other kinds of reordering are really painful and very questionable.
So x86 still does have visible reordering. Sparc has pretty much the same thing (interestingly, Sparc had multiple memory models, but I don't think anybody ever used anything but TSO, possibly exactly because they could compare the effects of switching memory models, and see that TSO worked best). But it lacks the crazy re-ordering.
(And then the "let's re-order but remember and check" model allows x86 to re-order when it wants to, and the transactional instructions then make that explicit and expose it to users)
Linus
> >
> > See? No memory barriers. No nothing. Just that same model of "load early and mark".
>
> As a programmer I think it is easy too, but nobody does it, not even x86.
Actually, from what I can tell, that's pretty much exactly what modern big-core Intel CPU's are doing, it's just that write buffering is explicitly visible to software and thus reordering reads vs earlier local writes happens without that whole extra checking.
Basically, delaying a store isn't considered "speculatively reordering" it. So no, x86 is not sequentially consistent, because of the delayed stores.
Some people call it "x86-TSO".
So the example I gave was designed to illustrate the point about re-ordering and how barriers make sense - and how they don't. Not so much designed to show what x86 does.
The normal x86 model lets loads go ahead of local stores without any extra work (or put another way: the stores can be buffered over later loads). And that's actually the only really common case that you really absolutely need to re-order very aggressively for good performance.
So x86 cheats and doesn't do full sequential ordering. It does what it does for the usual historical reasons: there was always a store buffer, so the whole "stores can be delayed" has been there since day one. Even back when there were no caches at all, the store buffer still meant that stores would be delayed, and that was "visible" both for DMA and for SMP (yes, some people - notably Sequent - did SMP on 80386s with no caches).
So the x86 "delayed but in-order stores" is not some kind of smart design by superhuman minds that knew it was a good idea, it's a historical accident like pretty much all tech issues are. Then when caches got added, so as not to break anything (since there were applications that would break, even back then), everything else was done fairly strictly ordered. Which doesn't really say much, since the cores were in-order, the caches weren't horribly aggressive (I think they were originally blocking), and it was just all on a common frontside bus anyway. So there was a lot of inherent ordering there.
With the P6 and OoO you had the first situation where Intel really almost could get some ordering issues, and they did in fact have a few SMP bugs.
But, when pretty much everybody else says "we'll re-order anything against anything" (ie weak ordering) because they thought they were clever and had learnt from other peoples mistakes, x86 for historical reasons basically said "we'll only buffer writes".
And it turns out that write buffering is like 90% of all the re-ordering you need for performance anyway, and is one of the simpler concepts for software (and programmers tiny little minds) to handle, so it's a fairly reasonable engineering trade-off. So the clever people who wanted to reorder more aggressively turned out to not be that clever after all.
Because in contrast to delaying stores, things like letting stores go early before earlier loads makes no sense at all, and re-ordering stores against other stores is pretty dubious too. It just doesn't help that much. Load-vs-load reordering gets you some performance, but as I tried to outline, you can get that without requiring barriers, and it gets more complex for software.
So the things that weak memory ordering allows (over TSO) aren't actually all that helpful, and they do hurt software.
So the "weak memory ordering" people made a mistake. They wanted a "clean" architecture and not those ugly arbitrary rules where only one kind of re-ordering is done. This is where the alpha really shone - the cleanest of them all, and the most broken of them all. That was basically one big "fuck you" to sanity, saying that if you allow one kind of reordering, you should damn well allow anything at all. Never mind that that "one kind" was the one really sane and important one, and the other kinds of reordering are really painful and very questionable.
So x86 still does have visible reordering. Sparc has pretty much the same thing (interestingly, Sparc had multiple memory models, but I don't think anybody ever used anything but TSO, possibly exactly because they could compare the effects of switching memory models, and see that TSO worked best). But it lacks the crazy re-ordering.
(And then the "let's re-order but remember and check" model allows x86 to re-order when it wants to, and the transactional instructions then make that explicit and expose it to users)
Linus