By: dmcq (dmcq.delete@this.fano.co.uk), July 15, 2015 6:21 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on July 15, 2015 4:38 am wrote:
> dmcq (dmcq.delete@this.fano.co.uk) on July 15, 2015 3:51 am wrote:
> > anon (anon.delete@this.anon.com) on July 12, 2015 7:42 pm wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 12, 2015 11:24 am wrote:
> > > > anon (anon.delete@this.anon.com) on July 12, 2015 3:42 am wrote:
> > > > >
> > > > > So all of that happens inside the core. Memory ordering instructions
> > > > > have to prevent these reorderings within the core.
> > > >
> > > > No, they really don't.
> > >
> > > Obviously:
> > > * I was responding to the claim about hardware in general. Not a particular implementation.
> > > * I meant prevent the *appearance* of those reorderings.
> > > * I acknowledged speculative approaches that work to prevent apparent reordering.
> > >
> > > I appreciate the time you took to respond though.
> > >
> > > > So just look at that example of "do a load early" model: just do the load early, you marked
> > > > it somewhere in the memory subsystem, and you added it to your memory access retirement queue.
> > > > Now you just need to figure out if anybody did a store that invalidated the load.
> > > >
> > > > And guess what? That's not so hard. If you did an early load, that means that you had to get the cacheline
> > > > with the load data. Now, how do you figure out whether another store disturbed that data? Sure, you
> > > > still have the same store buffer logic that you used fro UP for the local stores, but you also see
> > > > the remote stores: they'd have to get the cacheline from you. So all your "marker in the memory subsystem"
> > > > has to react to is that the cacheline it marked went away (and maybe the cacheline comes back, but
> > > > that doesn't help - if it went away, it causes the marker to be "invalid").
> > > >
> > > > See? No memory barriers. No nothing. Just that same model of "load early and mark".
> > >
> > > This doesn't invalidate my comment as I explained above, but I'd
> > > like to respond to it because this topic is of interest to me.
> > >
> > > You're talking about memory operations to a single address, and as such, it has nothing
> > > to do with the memory ordering problem. Fine, speculatively load an address and check
> > > for invalidation, but that does not help the memory ordering problem.
> > >
> > > The problem (and the reason why x86 explicitly allows and
> > > actually does reorder in practice) is load vs older
> > > store. You have moved your load before the store executes,
> > > and you have all the mechanism in place to ensure
> > > that load is valid. How do you determine if *another* CPU has loaded the location of your store within that
> > > reordered window? Eh? And no, you can't check that the
> > > store cacheline remains exclusive on your CPU because
> > > you may not even own it exclusive let alone know the address of it at the time you performed your load.
> > >
> > > Example, all memory set to 0:
> > >
> > > CPU1 CPU2
> > > 1 -> [x] 1 -> [y]
> > > > r0 And the important take-away from this is two-fold:
> > > >
> > > > (1) notice how the smarter CPU core that didn't need the memory barriers didn't re-order
> > > > memory operations less. No sirree! It's the smarter core, and it actually re-o5rders
> > > > memory operations more aggressively than the stupid core that needed memory barriers.
> > > >
> > > > In fact, please realize that the memory barrier model performs fewer re-orderings even when
> > > > there are no memory barriers present - but it particularly sucks when you actually use memory
> > > > barriers. The smart CPU "just works" and continues to re-order operations even when you need
> > > > strict ordering - because in the (very unlikely) situation that you actually get a conflict,
> > > > it will re-do things. The stupid memory barrier model will actually slow down (often to a crawl),
> > > > because the memory barriers will limit its already limited re-ordering much further.
> > > >
> > > > (2) weak memory ordering is an artifact of historical CPU design. It made sense in the
> > > > exact same way that RISC made sense: it's the same notion of "do as well as you can, with
> > > > the limitations of the time in place, don't waste any effort on anything 'clever'".
> > > >
> > > > So you have that odd "dip" in the middle where weak memory ordering makes sense. In really
> > > > old designs, you didn't have caches, and CPU's were in-order, so memory barriers and weak
> > > > memory ordering was a non-issue. And once you get to a certain point of complexity in your
> > > > memory pipe, weak ordering again makes no sense because you got sufficiently advanced tools
> > > > to handle re-ordering that synchronizing things by serializing accesses is just stupid.
> > > >
> > > > The good news is that a weak memory model can be strengthened. IBM could
> > > > - if they chose to - just say "ok, starting with POWER9, all of those stupid
> > > > barriers are no-ops, because we just do things right in the core".
> > > >
> > > > The bad news is that the weak memory model people usually have their mental model clouded by
> > > > their memory barriers, and they continue to claim that it improves performance, despite that
> > > > clearly not being the case. It's the same way we saw people argue for in-order cores not that
> > > > many years ago (with BS like "they are more power-efficient". No, they were just simple and
> > > > stupid, and performed badly enough to negate any power advantage many times over).
> > >
> > > I think you give less credit than they deserve. POWER designers for example would surely be looking
> > > at x86 performance and determining where they can improve. Their own mainframe designers actually
> > > implement x86-like ordering presumably with relatively good performance. Not that they are necessarily
> > > all the same people working on both lines, but at least you know patents would not get in the way
> > > of borrowing ideas there. They've been following somewhat similar paths as Intel designers have in
> > > this regard, reducing cost of barriers, implementing store address speculation, etc.
> > >
> > > In the case of ARM, I would say there is zero chance they did not re-examine the memory ordering model when
> > > defining the 64-bit ISA, with *data* (at least from simulations)
> > > rather than wives-tales, and they would have
> > > taken input from Apple and their own designers (AFAIK Cortex cores do not do a lot of reordering anyway).
> > >
> > > And there is zero chance that any of the designers involved in any modern OOOE processor are unaware of
> > > any the speculative techniques that you or I know of, and they probably know a few that we don't too.
> > >
> > I think that's true of x86 but may not be for others.
>
> I'm honestly not sure what you're replying to here. Maybe replied to the wrong post. Even so...
>
> > If one keeps writing to the same place
> > in between other writes then the write may be delayed by the store queue doing coalescing and
> > yet other writes may pass it. In that case the writer should do a release after it has written
> > some logical unit and the reader do an acquire before it starts reading. Basically in them
> > release-acquire is the way to ensure data read by a reader is what a writer wrote.
> >
>
> I don't know what you mean. Except in the special case of forwarded loads from store queue,
> memory ordering is about operations to multiple memory locations. Writes to any single
> location will always have a well defined order (again, not the case for store queue where
> you have "processor local" orderings and a single "cache coherent" ordering).
>
> So if we're talking about one CPU that is a writer and another that is a reader, it makes no sense to talk
> about memory barriers unless you specifically have memory operations before *and* after the barrier. What
> barriers do is prevent one *of the local CPU's* operations from becoming visible before another.
>
> If you have this example:
>
> CPU1
> store.release r1, [letterbox]
>
> CPU2
> load.acquire [letterbox], r2
>
> Then it tells you nothing about what those barriers are enforcing ordering against. In this
> example there is nothing that ensures CPU2 has seen CPU1 store if they run concurrently.
>
I was referring to the bit above about it being relevant to multiple loads and stores. The only guarantee I believe designers put in is that the load acquire in CPU2 will eventually see the store release by CPU1 if you wait long enough.
> dmcq (dmcq.delete@this.fano.co.uk) on July 15, 2015 3:51 am wrote:
> > anon (anon.delete@this.anon.com) on July 12, 2015 7:42 pm wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 12, 2015 11:24 am wrote:
> > > > anon (anon.delete@this.anon.com) on July 12, 2015 3:42 am wrote:
> > > > >
> > > > > So all of that happens inside the core. Memory ordering instructions
> > > > > have to prevent these reorderings within the core.
> > > >
> > > > No, they really don't.
> > >
> > > Obviously:
> > > * I was responding to the claim about hardware in general. Not a particular implementation.
> > > * I meant prevent the *appearance* of those reorderings.
> > > * I acknowledged speculative approaches that work to prevent apparent reordering.
> > >
> > > I appreciate the time you took to respond though.
> > >
> > > > So just look at that example of "do a load early" model: just do the load early, you marked
> > > > it somewhere in the memory subsystem, and you added it to your memory access retirement queue.
> > > > Now you just need to figure out if anybody did a store that invalidated the load.
> > > >
> > > > And guess what? That's not so hard. If you did an early load, that means that you had to get the cacheline
> > > > with the load data. Now, how do you figure out whether another store disturbed that data? Sure, you
> > > > still have the same store buffer logic that you used fro UP for the local stores, but you also see
> > > > the remote stores: they'd have to get the cacheline from you. So all your "marker in the memory subsystem"
> > > > has to react to is that the cacheline it marked went away (and maybe the cacheline comes back, but
> > > > that doesn't help - if it went away, it causes the marker to be "invalid").
> > > >
> > > > See? No memory barriers. No nothing. Just that same model of "load early and mark".
> > >
> > > This doesn't invalidate my comment as I explained above, but I'd
> > > like to respond to it because this topic is of interest to me.
> > >
> > > You're talking about memory operations to a single address, and as such, it has nothing
> > > to do with the memory ordering problem. Fine, speculatively load an address and check
> > > for invalidation, but that does not help the memory ordering problem.
> > >
> > > The problem (and the reason why x86 explicitly allows and
> > > actually does reorder in practice) is load vs older
> > > store. You have moved your load before the store executes,
> > > and you have all the mechanism in place to ensure
> > > that load is valid. How do you determine if *another* CPU has loaded the location of your store within that
> > > reordered window? Eh? And no, you can't check that the
> > > store cacheline remains exclusive on your CPU because
> > > you may not even own it exclusive let alone know the address of it at the time you performed your load.
> > >
> > > Example, all memory set to 0:
> > >
> > > CPU1 CPU2
> > > 1 -> [x] 1 -> [y]
> > > > r0 And the important take-away from this is two-fold:
> > > >
> > > > (1) notice how the smarter CPU core that didn't need the memory barriers didn't re-order
> > > > memory operations less. No sirree! It's the smarter core, and it actually re-o5rders
> > > > memory operations more aggressively than the stupid core that needed memory barriers.
> > > >
> > > > In fact, please realize that the memory barrier model performs fewer re-orderings even when
> > > > there are no memory barriers present - but it particularly sucks when you actually use memory
> > > > barriers. The smart CPU "just works" and continues to re-order operations even when you need
> > > > strict ordering - because in the (very unlikely) situation that you actually get a conflict,
> > > > it will re-do things. The stupid memory barrier model will actually slow down (often to a crawl),
> > > > because the memory barriers will limit its already limited re-ordering much further.
> > > >
> > > > (2) weak memory ordering is an artifact of historical CPU design. It made sense in the
> > > > exact same way that RISC made sense: it's the same notion of "do as well as you can, with
> > > > the limitations of the time in place, don't waste any effort on anything 'clever'".
> > > >
> > > > So you have that odd "dip" in the middle where weak memory ordering makes sense. In really
> > > > old designs, you didn't have caches, and CPU's were in-order, so memory barriers and weak
> > > > memory ordering was a non-issue. And once you get to a certain point of complexity in your
> > > > memory pipe, weak ordering again makes no sense because you got sufficiently advanced tools
> > > > to handle re-ordering that synchronizing things by serializing accesses is just stupid.
> > > >
> > > > The good news is that a weak memory model can be strengthened. IBM could
> > > > - if they chose to - just say "ok, starting with POWER9, all of those stupid
> > > > barriers are no-ops, because we just do things right in the core".
> > > >
> > > > The bad news is that the weak memory model people usually have their mental model clouded by
> > > > their memory barriers, and they continue to claim that it improves performance, despite that
> > > > clearly not being the case. It's the same way we saw people argue for in-order cores not that
> > > > many years ago (with BS like "they are more power-efficient". No, they were just simple and
> > > > stupid, and performed badly enough to negate any power advantage many times over).
> > >
> > > I think you give less credit than they deserve. POWER designers for example would surely be looking
> > > at x86 performance and determining where they can improve. Their own mainframe designers actually
> > > implement x86-like ordering presumably with relatively good performance. Not that they are necessarily
> > > all the same people working on both lines, but at least you know patents would not get in the way
> > > of borrowing ideas there. They've been following somewhat similar paths as Intel designers have in
> > > this regard, reducing cost of barriers, implementing store address speculation, etc.
> > >
> > > In the case of ARM, I would say there is zero chance they did not re-examine the memory ordering model when
> > > defining the 64-bit ISA, with *data* (at least from simulations)
> > > rather than wives-tales, and they would have
> > > taken input from Apple and their own designers (AFAIK Cortex cores do not do a lot of reordering anyway).
> > >
> > > And there is zero chance that any of the designers involved in any modern OOOE processor are unaware of
> > > any the speculative techniques that you or I know of, and they probably know a few that we don't too.
> > >
> > I think that's true of x86 but may not be for others.
>
> I'm honestly not sure what you're replying to here. Maybe replied to the wrong post. Even so...
>
> > If one keeps writing to the same place
> > in between other writes then the write may be delayed by the store queue doing coalescing and
> > yet other writes may pass it. In that case the writer should do a release after it has written
> > some logical unit and the reader do an acquire before it starts reading. Basically in them
> > release-acquire is the way to ensure data read by a reader is what a writer wrote.
> >
>
> I don't know what you mean. Except in the special case of forwarded loads from store queue,
> memory ordering is about operations to multiple memory locations. Writes to any single
> location will always have a well defined order (again, not the case for store queue where
> you have "processor local" orderings and a single "cache coherent" ordering).
>
> So if we're talking about one CPU that is a writer and another that is a reader, it makes no sense to talk
> about memory barriers unless you specifically have memory operations before *and* after the barrier. What
> barriers do is prevent one *of the local CPU's* operations from becoming visible before another.
>
> If you have this example:
>
> CPU1
> store.release r1, [letterbox]
>
> CPU2
> load.acquire [letterbox], r2
>
> Then it tells you nothing about what those barriers are enforcing ordering against. In this
> example there is nothing that ensures CPU2 has seen CPU1 store if they run concurrently.
>
I was referring to the bit above about it being relevant to multiple loads and stores. The only guarantee I believe designers put in is that the load acquire in CPU2 will eventually see the store release by CPU1 if you wait long enough.