By: anon (anon.delete@this.anon.com), July 12, 2015 6:52 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 12, 2015 1:07 pm wrote:
> EduardoS (no.delete@this.spam.com) on July 12, 2015 11:53 am wrote:
> > >
> > > See? No memory barriers. No nothing. Just that same model of "load early and mark".
> >
> > As a programmer I think it is easy too, but nobody does it, not even x86.
>
> Actually, from what I can tell, that's pretty much exactly what modern big-core Intel CPU's
> are doing, it's just that write buffering is explicitly visible to software and thus reordering
> reads vs earlier local writes happens without that whole extra checking.
>
> Basically, delaying a store isn't considered "speculatively reordering" it.
> So no, x86 is not sequentially consistent, because of the delayed stores.
>
> Some people call it "x86-TSO".
>
> So the example I gave was designed to illustrate the point about re-ordering and how barriers
> make sense - and how they don't. Not so much designed to show what x86 does.
>
> The normal x86 model lets loads go ahead of local stores without any extra work (or put another
> way: the stores can be buffered over later loads). And that's actually the only really common
> case that you really absolutely need to re-order very aggressively for good performance.
>
> So x86 cheats and doesn't do full sequential ordering. It does what it does for the usual historical reasons:
> there was always a store buffer, so the whole "stores can be delayed" has been there since day one. Even back
> when there were no caches at all, the store buffer still meant that stores would be delayed, and that was "visible"
> both for DMA and for SMP (yes, some people - notably Sequent - did SMP on 80386s with no caches).
>
> So the x86 "delayed but in-order stores" is not some kind of smart design by superhuman minds that knew
> it was a good idea, it's a historical accident like pretty much all tech issues are. Then when caches
> got added, so as not to break anything (since there were applications that would break, even back then),
> everything else was done fairly strictly ordered. Which doesn't really say much, since the cores were
> in-order, the caches weren't horribly aggressive (I think they were originally blocking), and it was
> just all on a common frontside bus anyway. So there was a lot of inherent ordering there.
>
> With the P6 and OoO you had the first situation where Intel really almost could
> get some ordering issues, and they did in fact have a few SMP bugs.
>
> But, when pretty much everybody else says "we'll re-order anything against anything" (ie
> weak ordering) because they thought they were clever and had learnt from other peoples
> mistakes, x86 for historical reasons basically said "we'll only buffer writes".
Interestingly, Intel for a time was very hesitant to commit to their current memory consistency model by formalizing it in their ISA documents. IIRC this finally did happen around Core2(?) timeframe.
Presumably their cores were not implementing weaker consistency, but I'm sure there would have been questions over the issue of whether this stronger consistency would end up costing them significant performance. I guess they implemented memory disambiguation and load/load speculation around Core2 timeframe and decided that they can commit to strong load/load and store/store ordering because it will not hamper their performance significantly. (All this is purely conjecture on my part -- it's possible other reasons such as they found significant legacy code breakage, or that it may hurt them a bit but it would hurt competitors more, etc etc).
> EduardoS (no.delete@this.spam.com) on July 12, 2015 11:53 am wrote:
> > >
> > > See? No memory barriers. No nothing. Just that same model of "load early and mark".
> >
> > As a programmer I think it is easy too, but nobody does it, not even x86.
>
> Actually, from what I can tell, that's pretty much exactly what modern big-core Intel CPU's
> are doing, it's just that write buffering is explicitly visible to software and thus reordering
> reads vs earlier local writes happens without that whole extra checking.
>
> Basically, delaying a store isn't considered "speculatively reordering" it.
> So no, x86 is not sequentially consistent, because of the delayed stores.
>
> Some people call it "x86-TSO".
>
> So the example I gave was designed to illustrate the point about re-ordering and how barriers
> make sense - and how they don't. Not so much designed to show what x86 does.
>
> The normal x86 model lets loads go ahead of local stores without any extra work (or put another
> way: the stores can be buffered over later loads). And that's actually the only really common
> case that you really absolutely need to re-order very aggressively for good performance.
>
> So x86 cheats and doesn't do full sequential ordering. It does what it does for the usual historical reasons:
> there was always a store buffer, so the whole "stores can be delayed" has been there since day one. Even back
> when there were no caches at all, the store buffer still meant that stores would be delayed, and that was "visible"
> both for DMA and for SMP (yes, some people - notably Sequent - did SMP on 80386s with no caches).
>
> So the x86 "delayed but in-order stores" is not some kind of smart design by superhuman minds that knew
> it was a good idea, it's a historical accident like pretty much all tech issues are. Then when caches
> got added, so as not to break anything (since there were applications that would break, even back then),
> everything else was done fairly strictly ordered. Which doesn't really say much, since the cores were
> in-order, the caches weren't horribly aggressive (I think they were originally blocking), and it was
> just all on a common frontside bus anyway. So there was a lot of inherent ordering there.
>
> With the P6 and OoO you had the first situation where Intel really almost could
> get some ordering issues, and they did in fact have a few SMP bugs.
>
> But, when pretty much everybody else says "we'll re-order anything against anything" (ie
> weak ordering) because they thought they were clever and had learnt from other peoples
> mistakes, x86 for historical reasons basically said "we'll only buffer writes".
Interestingly, Intel for a time was very hesitant to commit to their current memory consistency model by formalizing it in their ISA documents. IIRC this finally did happen around Core2(?) timeframe.
Presumably their cores were not implementing weaker consistency, but I'm sure there would have been questions over the issue of whether this stronger consistency would end up costing them significant performance. I guess they implemented memory disambiguation and load/load speculation around Core2 timeframe and decided that they can commit to strong load/load and store/store ordering because it will not hamper their performance significantly. (All this is purely conjecture on my part -- it's possible other reasons such as they found significant legacy code breakage, or that it may hurt them a bit but it would hurt competitors more, etc etc).