By: anon (anon.delete@this.anon.com), July 13, 2015 6:05 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on July 12, 2015 10:34 pm wrote:
> anon (anon.delete@this.anon.com) on July 12, 2015 7:42 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 12, 2015 11:24 am wrote:
> > > anon (anon.delete@this.anon.com) on July 12, 2015 3:42 am wrote:
> > > >
> > > > So all of that happens inside the core. Memory ordering instructions
> > > > have to prevent these reorderings within the core.
> > >
> > > No, they really don't.
> >
> > Obviously:
> > * I was responding to the claim about hardware in general. Not a particular implementation.
> > * I meant prevent the *appearance* of those reorderings.
> > * I acknowledged speculative approaches that work to prevent apparent reordering.
> >
> > I appreciate the time you took to respond though.
> >
> > > So just look at that example of "do a load early" model: just do the load early, you marked
> > > it somewhere in the memory subsystem, and you added it to your memory access retirement queue.
> > > Now you just need to figure out if anybody did a store that invalidated the load.
> > >
> > > And guess what? That's not so hard. If you did an early load, that means that you had to get the cacheline
> > > with the load data. Now, how do you figure out whether another store disturbed that data? Sure, you
> > > still have the same store buffer logic that you used fro UP for the local stores, but you also see
> > > the remote stores: they'd have to get the cacheline from you. So all your "marker in the memory subsystem"
> > > has to react to is that the cacheline it marked went away (and maybe the cacheline comes back, but
> > > that doesn't help - if it went away, it causes the marker to be "invalid").
> > >
> > > See? No memory barriers. No nothing. Just that same model of "load early and mark".
> >
> > This doesn't invalidate my comment as I explained above, but I'd
> > like to respond to it because this topic is of interest to me.
> >
> > You're talking about memory operations to a single address, and as such, it has nothing
> > to do with the memory ordering problem. Fine, speculatively load an address and check
> > for invalidation, but that does not help the memory ordering problem.
> >
> > The problem (and the reason why x86 explicitly allows and
> > actually does reorder in practice) is load vs older
> > store. You have moved your load before the store executes,
> > and you have all the mechanism in place to ensure
> > that load is valid. How do you determine if *another* CPU has loaded the location of your store within that
> > reordered window? Eh? And no, you can't check that the
> > store cacheline remains exclusive on your CPU because
> > you may not even own it exclusive let alone know the address of it at the time you performed your load.
> >
> > Example, all memory set to 0:
> >
> > CPU1 CPU2
> > 1 -> [x] 1 -> [y]
> > > r0 And the important take-away from this is two-fold:
> > >
> > > (1) notice how the smarter CPU core that didn't need the memory barriers didn't re-order
> > > memory operations less. No sirree! It's the smarter core, and it actually re-o5rders
> > > memory operations more aggressively than the stupid core that needed memory barriers.
> > >
> > > In fact, please realize that the memory barrier model performs fewer re-orderings even when
> > > there are no memory barriers present - but it particularly sucks when you actually use memory
> > > barriers. The smart CPU "just works" and continues to re-order operations even when you need
> > > strict ordering - because in the (very unlikely) situation that you actually get a conflict,
> > > it will re-do things. The stupid memory barrier model will actually slow down (often to a crawl),
> > > because the memory barriers will limit its already limited re-ordering much further.
> > >
> > > (2) weak memory ordering is an artifact of historical CPU design. It made sense in the
> > > exact same way that RISC made sense: it's the same notion of "do as well as you can, with
> > > the limitations of the time in place, don't waste any effort on anything 'clever'".
> > >
> > > So you have that odd "dip" in the middle where weak memory ordering makes sense. In really
> > > old designs, you didn't have caches, and CPU's were in-order, so memory barriers and weak
> > > memory ordering was a non-issue. And once you get to a certain point of complexity in your
> > > memory pipe, weak ordering again makes no sense because you got sufficiently advanced tools
> > > to handle re-ordering that synchronizing things by serializing accesses is just stupid.
> > >
> > > The good news is that a weak memory model can be strengthened. IBM could
> > > - if they chose to - just say "ok, starting with POWER9, all of those stupid
> > > barriers are no-ops, because we just do things right in the core".
> > >
> > > The bad news is that the weak memory model people usually have their mental model clouded by
> > > their memory barriers, and they continue to claim that it improves performance, despite that
> > > clearly not being the case. It's the same way we saw people argue for in-order cores not that
> > > many years ago (with BS like "they are more power-efficient". No, they were just simple and
> > > stupid, and performed badly enough to negate any power advantage many times over).
> >
> > I think you give less credit than they deserve. POWER designers for example would surely be looking
> > at x86 performance and determining where they can improve. Their own mainframe designers actually
> > implement x86-like ordering presumably with relatively good performance. Not that they are necessarily
> > all the same people working on both lines, but at least you know patents would not get in the way
> > of borrowing ideas there. They've been following somewhat similar paths as Intel designers have in
> > this regard, reducing cost of barriers, implementing store address speculation, etc.
>
> If you look at IBM's zSeries, it's actually quite different than x86, due to IBM's emphasis on reliability.
> x86 has write-back L1 caches where the reliability is derived from the robust memory cells (8T design).
>
> IBM zArch uses write-through caching for *all* SRAM-based caches (on some
> designs that meant L1, L2, and L3 were all write-through, and only L4 was
> write-back; on more recent ones, I think it's L1 & L2 are write-through).
>
> This creates a huge amount of pressure on the L2 and L3 caches to handle the full store bandwidth of
> the machine. Look at bulldozer for an example of what happens when that doesn't quite work out.
Sure, but I was referring specifically to memory ordering which is more similar to x86 than it is to power (at least on cacheable accesses, not sure about mmio). At the "big ideas" level, the ways to implement high performance OOOE and ability to move loads early and such things should actually share a lot of overlap with the solution space for x86 cores too.
You're right though, in practice this doesn't necessarily mean the techniques mainframe use will look anything like what x86 implementations use. My point is though that IBM is not blind to the realities of implementing such memory ordering.
>
> > In the case of ARM, I would say there is zero chance they did not re-examine the memory ordering model when
> > defining the 64-bit ISA, with *data* (at least from simulations)
> > rather than wives-tales, and they would have
> > taken input from Apple and their own designers (AFAIK Cortex cores do not do a lot of reordering anyway).
>
> Based on my conservations with designers, the ARM ordering model is advantageous for
> simpler cores...but for anything A15+ it's basically not an advantage over x86.
Do smaller cores actually do significantly more reordering such as store/store reordering
>
> > And there is zero chance that any of the designers involved in any modern OOOE processor are unaware of
> > any the speculative techniques that you or I know of, and they probably know a few that we don't >too.
>
> Ah yes, but some of ARM's architectural choices are undoubtedly swayed by
> the (lack of) circuit design capabilities amongst their main customers.
Simpler design is an advantage, in a very real sense. Although I'm not sure where that point gets us exactly. Do any 3rd party designs actually implement weaker ordering? Would be instructive to know. Is weaker ordering relatively more important for implementations with less aggressive circuit design? It's not clear.
> anon (anon.delete@this.anon.com) on July 12, 2015 7:42 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 12, 2015 11:24 am wrote:
> > > anon (anon.delete@this.anon.com) on July 12, 2015 3:42 am wrote:
> > > >
> > > > So all of that happens inside the core. Memory ordering instructions
> > > > have to prevent these reorderings within the core.
> > >
> > > No, they really don't.
> >
> > Obviously:
> > * I was responding to the claim about hardware in general. Not a particular implementation.
> > * I meant prevent the *appearance* of those reorderings.
> > * I acknowledged speculative approaches that work to prevent apparent reordering.
> >
> > I appreciate the time you took to respond though.
> >
> > > So just look at that example of "do a load early" model: just do the load early, you marked
> > > it somewhere in the memory subsystem, and you added it to your memory access retirement queue.
> > > Now you just need to figure out if anybody did a store that invalidated the load.
> > >
> > > And guess what? That's not so hard. If you did an early load, that means that you had to get the cacheline
> > > with the load data. Now, how do you figure out whether another store disturbed that data? Sure, you
> > > still have the same store buffer logic that you used fro UP for the local stores, but you also see
> > > the remote stores: they'd have to get the cacheline from you. So all your "marker in the memory subsystem"
> > > has to react to is that the cacheline it marked went away (and maybe the cacheline comes back, but
> > > that doesn't help - if it went away, it causes the marker to be "invalid").
> > >
> > > See? No memory barriers. No nothing. Just that same model of "load early and mark".
> >
> > This doesn't invalidate my comment as I explained above, but I'd
> > like to respond to it because this topic is of interest to me.
> >
> > You're talking about memory operations to a single address, and as such, it has nothing
> > to do with the memory ordering problem. Fine, speculatively load an address and check
> > for invalidation, but that does not help the memory ordering problem.
> >
> > The problem (and the reason why x86 explicitly allows and
> > actually does reorder in practice) is load vs older
> > store. You have moved your load before the store executes,
> > and you have all the mechanism in place to ensure
> > that load is valid. How do you determine if *another* CPU has loaded the location of your store within that
> > reordered window? Eh? And no, you can't check that the
> > store cacheline remains exclusive on your CPU because
> > you may not even own it exclusive let alone know the address of it at the time you performed your load.
> >
> > Example, all memory set to 0:
> >
> > CPU1 CPU2
> > 1 -> [x] 1 -> [y]
> > > r0 And the important take-away from this is two-fold:
> > >
> > > (1) notice how the smarter CPU core that didn't need the memory barriers didn't re-order
> > > memory operations less. No sirree! It's the smarter core, and it actually re-o5rders
> > > memory operations more aggressively than the stupid core that needed memory barriers.
> > >
> > > In fact, please realize that the memory barrier model performs fewer re-orderings even when
> > > there are no memory barriers present - but it particularly sucks when you actually use memory
> > > barriers. The smart CPU "just works" and continues to re-order operations even when you need
> > > strict ordering - because in the (very unlikely) situation that you actually get a conflict,
> > > it will re-do things. The stupid memory barrier model will actually slow down (often to a crawl),
> > > because the memory barriers will limit its already limited re-ordering much further.
> > >
> > > (2) weak memory ordering is an artifact of historical CPU design. It made sense in the
> > > exact same way that RISC made sense: it's the same notion of "do as well as you can, with
> > > the limitations of the time in place, don't waste any effort on anything 'clever'".
> > >
> > > So you have that odd "dip" in the middle where weak memory ordering makes sense. In really
> > > old designs, you didn't have caches, and CPU's were in-order, so memory barriers and weak
> > > memory ordering was a non-issue. And once you get to a certain point of complexity in your
> > > memory pipe, weak ordering again makes no sense because you got sufficiently advanced tools
> > > to handle re-ordering that synchronizing things by serializing accesses is just stupid.
> > >
> > > The good news is that a weak memory model can be strengthened. IBM could
> > > - if they chose to - just say "ok, starting with POWER9, all of those stupid
> > > barriers are no-ops, because we just do things right in the core".
> > >
> > > The bad news is that the weak memory model people usually have their mental model clouded by
> > > their memory barriers, and they continue to claim that it improves performance, despite that
> > > clearly not being the case. It's the same way we saw people argue for in-order cores not that
> > > many years ago (with BS like "they are more power-efficient". No, they were just simple and
> > > stupid, and performed badly enough to negate any power advantage many times over).
> >
> > I think you give less credit than they deserve. POWER designers for example would surely be looking
> > at x86 performance and determining where they can improve. Their own mainframe designers actually
> > implement x86-like ordering presumably with relatively good performance. Not that they are necessarily
> > all the same people working on both lines, but at least you know patents would not get in the way
> > of borrowing ideas there. They've been following somewhat similar paths as Intel designers have in
> > this regard, reducing cost of barriers, implementing store address speculation, etc.
>
> If you look at IBM's zSeries, it's actually quite different than x86, due to IBM's emphasis on reliability.
> x86 has write-back L1 caches where the reliability is derived from the robust memory cells (8T design).
>
> IBM zArch uses write-through caching for *all* SRAM-based caches (on some
> designs that meant L1, L2, and L3 were all write-through, and only L4 was
> write-back; on more recent ones, I think it's L1 & L2 are write-through).
>
> This creates a huge amount of pressure on the L2 and L3 caches to handle the full store bandwidth of
> the machine. Look at bulldozer for an example of what happens when that doesn't quite work out.
Sure, but I was referring specifically to memory ordering which is more similar to x86 than it is to power (at least on cacheable accesses, not sure about mmio). At the "big ideas" level, the ways to implement high performance OOOE and ability to move loads early and such things should actually share a lot of overlap with the solution space for x86 cores too.
You're right though, in practice this doesn't necessarily mean the techniques mainframe use will look anything like what x86 implementations use. My point is though that IBM is not blind to the realities of implementing such memory ordering.
>
> > In the case of ARM, I would say there is zero chance they did not re-examine the memory ordering model when
> > defining the 64-bit ISA, with *data* (at least from simulations)
> > rather than wives-tales, and they would have
> > taken input from Apple and their own designers (AFAIK Cortex cores do not do a lot of reordering anyway).
>
> Based on my conservations with designers, the ARM ordering model is advantageous for
> simpler cores...but for anything A15+ it's basically not an advantage over x86.
Do smaller cores actually do significantly more reordering such as store/store reordering
>
> > And there is zero chance that any of the designers involved in any modern OOOE processor are unaware of
> > any the speculative techniques that you or I know of, and they probably know a few that we don't >too.
>
> Ah yes, but some of ARM's architectural choices are undoubtedly swayed by
> the (lack of) circuit design capabilities amongst their main customers.
Simpler design is an advantage, in a very real sense. Although I'm not sure where that point gets us exactly. Do any 3rd party designs actually implement weaker ordering? Would be instructive to know. Is weaker ordering relatively more important for implementations with less aggressive circuit design? It's not clear.