By: dmcq (dmcq.delete@this.fano.co.uk), July 15, 2015 3:51 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on July 12, 2015 7:42 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 12, 2015 11:24 am wrote:
> > anon (anon.delete@this.anon.com) on July 12, 2015 3:42 am wrote:
> > >
> > > So all of that happens inside the core. Memory ordering instructions
> > > have to prevent these reorderings within the core.
> >
> > No, they really don't.
>
> Obviously:
> * I was responding to the claim about hardware in general. Not a particular implementation.
> * I meant prevent the *appearance* of those reorderings.
> * I acknowledged speculative approaches that work to prevent apparent reordering.
>
> I appreciate the time you took to respond though.
>
> > So just look at that example of "do a load early" model: just do the load early, you marked
> > it somewhere in the memory subsystem, and you added it to your memory access retirement queue.
> > Now you just need to figure out if anybody did a store that invalidated the load.
> >
> > And guess what? That's not so hard. If you did an early load, that means that you had to get the cacheline
> > with the load data. Now, how do you figure out whether another store disturbed that data? Sure, you
> > still have the same store buffer logic that you used fro UP for the local stores, but you also see
> > the remote stores: they'd have to get the cacheline from you. So all your "marker in the memory subsystem"
> > has to react to is that the cacheline it marked went away (and maybe the cacheline comes back, but
> > that doesn't help - if it went away, it causes the marker to be "invalid").
> >
> > See? No memory barriers. No nothing. Just that same model of "load early and mark".
>
> This doesn't invalidate my comment as I explained above, but I'd
> like to respond to it because this topic is of interest to me.
>
> You're talking about memory operations to a single address, and as such, it has nothing
> to do with the memory ordering problem. Fine, speculatively load an address and check
> for invalidation, but that does not help the memory ordering problem.
>
> The problem (and the reason why x86 explicitly allows and actually does reorder in practice) is load vs older
> store. You have moved your load before the store executes, and you have all the mechanism in place to ensure
> that load is valid. How do you determine if *another* CPU has loaded the location of your store within that
> reordered window? Eh? And no, you can't check that the store cacheline remains exclusive on your CPU because
> you may not even own it exclusive let alone know the address of it at the time you performed your load.
>
> Example, all memory set to 0:
>
> CPU1 CPU2
> 1 -> [x] 1 -> [y]
> r0 > And the important take-away from this is two-fold:
> >
> > (1) notice how the smarter CPU core that didn't need the memory barriers didn't re-order
> > memory operations less. No sirree! It's the smarter core, and it actually re-o5rders
> > memory operations more aggressively than the stupid core that needed memory barriers.
> >
> > In fact, please realize that the memory barrier model performs fewer re-orderings even when
> > there are no memory barriers present - but it particularly sucks when you actually use memory
> > barriers. The smart CPU "just works" and continues to re-order operations even when you need
> > strict ordering - because in the (very unlikely) situation that you actually get a conflict,
> > it will re-do things. The stupid memory barrier model will actually slow down (often to a crawl),
> > because the memory barriers will limit its already limited re-ordering much further.
> >
> > (2) weak memory ordering is an artifact of historical CPU design. It made sense in the
> > exact same way that RISC made sense: it's the same notion of "do as well as you can, with
> > the limitations of the time in place, don't waste any effort on anything 'clever'".
> >
> > So you have that odd "dip" in the middle where weak memory ordering makes sense. In really
> > old designs, you didn't have caches, and CPU's were in-order, so memory barriers and weak
> > memory ordering was a non-issue. And once you get to a certain point of complexity in your
> > memory pipe, weak ordering again makes no sense because you got sufficiently advanced tools
> > to handle re-ordering that synchronizing things by serializing accesses is just stupid.
> >
> > The good news is that a weak memory model can be strengthened. IBM could
> > - if they chose to - just say "ok, starting with POWER9, all of those stupid
> > barriers are no-ops, because we just do things right in the core".
> >
> > The bad news is that the weak memory model people usually have their mental model clouded by
> > their memory barriers, and they continue to claim that it improves performance, despite that
> > clearly not being the case. It's the same way we saw people argue for in-order cores not that
> > many years ago (with BS like "they are more power-efficient". No, they were just simple and
> > stupid, and performed badly enough to negate any power advantage many times over).
>
> I think you give less credit than they deserve. POWER designers for example would surely be looking
> at x86 performance and determining where they can improve. Their own mainframe designers actually
> implement x86-like ordering presumably with relatively good performance. Not that they are necessarily
> all the same people working on both lines, but at least you know patents would not get in the way
> of borrowing ideas there. They've been following somewhat similar paths as Intel designers have in
> this regard, reducing cost of barriers, implementing store address speculation, etc.
>
> In the case of ARM, I would say there is zero chance they did not re-examine the memory ordering model when
> defining the 64-bit ISA, with *data* (at least from simulations) rather than wives-tales, and they would have
> taken input from Apple and their own designers (AFAIK Cortex cores do not do a lot of reordering anyway).
>
> And there is zero chance that any of the designers involved in any modern OOOE processor are unaware of
> any the speculative techniques that you or I know of, and they probably know a few that we don't too.
>
I think that's true of x86 but may not be for others. If one keeps writing to the same place in between other writes then the write may be delayed by the store queue doing coalescing and yet other writes may pass it. In that case the writer should do a release after it has written some logical unit and the reader do an acquire before it starts reading. Basically in them release-acquire is the way to ensure data read by a reader is what a writer wrote.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 12, 2015 11:24 am wrote:
> > anon (anon.delete@this.anon.com) on July 12, 2015 3:42 am wrote:
> > >
> > > So all of that happens inside the core. Memory ordering instructions
> > > have to prevent these reorderings within the core.
> >
> > No, they really don't.
>
> Obviously:
> * I was responding to the claim about hardware in general. Not a particular implementation.
> * I meant prevent the *appearance* of those reorderings.
> * I acknowledged speculative approaches that work to prevent apparent reordering.
>
> I appreciate the time you took to respond though.
>
> > So just look at that example of "do a load early" model: just do the load early, you marked
> > it somewhere in the memory subsystem, and you added it to your memory access retirement queue.
> > Now you just need to figure out if anybody did a store that invalidated the load.
> >
> > And guess what? That's not so hard. If you did an early load, that means that you had to get the cacheline
> > with the load data. Now, how do you figure out whether another store disturbed that data? Sure, you
> > still have the same store buffer logic that you used fro UP for the local stores, but you also see
> > the remote stores: they'd have to get the cacheline from you. So all your "marker in the memory subsystem"
> > has to react to is that the cacheline it marked went away (and maybe the cacheline comes back, but
> > that doesn't help - if it went away, it causes the marker to be "invalid").
> >
> > See? No memory barriers. No nothing. Just that same model of "load early and mark".
>
> This doesn't invalidate my comment as I explained above, but I'd
> like to respond to it because this topic is of interest to me.
>
> You're talking about memory operations to a single address, and as such, it has nothing
> to do with the memory ordering problem. Fine, speculatively load an address and check
> for invalidation, but that does not help the memory ordering problem.
>
> The problem (and the reason why x86 explicitly allows and actually does reorder in practice) is load vs older
> store. You have moved your load before the store executes, and you have all the mechanism in place to ensure
> that load is valid. How do you determine if *another* CPU has loaded the location of your store within that
> reordered window? Eh? And no, you can't check that the store cacheline remains exclusive on your CPU because
> you may not even own it exclusive let alone know the address of it at the time you performed your load.
>
> Example, all memory set to 0:
>
> CPU1 CPU2
> 1 -> [x] 1 -> [y]
> r0 > And the important take-away from this is two-fold:
> >
> > (1) notice how the smarter CPU core that didn't need the memory barriers didn't re-order
> > memory operations less. No sirree! It's the smarter core, and it actually re-o5rders
> > memory operations more aggressively than the stupid core that needed memory barriers.
> >
> > In fact, please realize that the memory barrier model performs fewer re-orderings even when
> > there are no memory barriers present - but it particularly sucks when you actually use memory
> > barriers. The smart CPU "just works" and continues to re-order operations even when you need
> > strict ordering - because in the (very unlikely) situation that you actually get a conflict,
> > it will re-do things. The stupid memory barrier model will actually slow down (often to a crawl),
> > because the memory barriers will limit its already limited re-ordering much further.
> >
> > (2) weak memory ordering is an artifact of historical CPU design. It made sense in the
> > exact same way that RISC made sense: it's the same notion of "do as well as you can, with
> > the limitations of the time in place, don't waste any effort on anything 'clever'".
> >
> > So you have that odd "dip" in the middle where weak memory ordering makes sense. In really
> > old designs, you didn't have caches, and CPU's were in-order, so memory barriers and weak
> > memory ordering was a non-issue. And once you get to a certain point of complexity in your
> > memory pipe, weak ordering again makes no sense because you got sufficiently advanced tools
> > to handle re-ordering that synchronizing things by serializing accesses is just stupid.
> >
> > The good news is that a weak memory model can be strengthened. IBM could
> > - if they chose to - just say "ok, starting with POWER9, all of those stupid
> > barriers are no-ops, because we just do things right in the core".
> >
> > The bad news is that the weak memory model people usually have their mental model clouded by
> > their memory barriers, and they continue to claim that it improves performance, despite that
> > clearly not being the case. It's the same way we saw people argue for in-order cores not that
> > many years ago (with BS like "they are more power-efficient". No, they were just simple and
> > stupid, and performed badly enough to negate any power advantage many times over).
>
> I think you give less credit than they deserve. POWER designers for example would surely be looking
> at x86 performance and determining where they can improve. Their own mainframe designers actually
> implement x86-like ordering presumably with relatively good performance. Not that they are necessarily
> all the same people working on both lines, but at least you know patents would not get in the way
> of borrowing ideas there. They've been following somewhat similar paths as Intel designers have in
> this regard, reducing cost of barriers, implementing store address speculation, etc.
>
> In the case of ARM, I would say there is zero chance they did not re-examine the memory ordering model when
> defining the 64-bit ISA, with *data* (at least from simulations) rather than wives-tales, and they would have
> taken input from Apple and their own designers (AFAIK Cortex cores do not do a lot of reordering anyway).
>
> And there is zero chance that any of the designers involved in any modern OOOE processor are unaware of
> any the speculative techniques that you or I know of, and they probably know a few that we don't too.
>
I think that's true of x86 but may not be for others. If one keeps writing to the same place in between other writes then the write may be delayed by the store queue doing coalescing and yet other writes may pass it. In that case the writer should do a release after it has written some logical unit and the reader do an acquire before it starts reading. Basically in them release-acquire is the way to ensure data read by a reader is what a writer wrote.