Sequential consistency in hardware

By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), August 4, 2020 11:56 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 3, 2020 11:20 pm wrote:
>
> In the case of nVidia, who knows? I don't have any inside knowledge, but if they were working
> on something like transactional memory due to their binary translation efforts, they may have
> ended up with memory units where sequential consistency simply falls out of that work.
>
> At that point it's not an "expense" any more. At that point it's suddenly possibly a performance advantage.

Side note, because this seems to confuse a lot of people: "memory ordering" doesn't necessarily actually imply the "ordering" part.

I think that's something that seems to constantly trip up some of the proponents of weakly ordered memory units, because they get hung up on the "ordering" part, and get hung up on that as the implementation.

But the best way to guarantee a certain "ordering" isn't actually to order anything at all. Not at the DRAM interface, not at the cache coherency layer, and not even in the memory units.

No, the best way to guarantee an "ordering" is to have a model where nobody can tell any different. This is why transactional memory comes into the picture, because if you start talking in terms of "transactions", the people who got hung up on the "ordering" part get unstuck, and get the point.

Now, I happen to think that transactional memory as an ISA feature tends to be a worthless exercise. I can only refer you to the discussions about why load-locked/store-conditional isn't actually a great interface. Transactional memory as an interface has all the same problems that load-locked/store-conditional has, except it dials them up to 11.

So I'm not a huge believer in transactional memory as an exposed ISA, but both as a conceptual model for memory ordering, and as an internal implementation it's a great model.

That internal implementation might be as part of a software translation layer (so it might be part of the internal ISA that you don't expose to users, because it's too tightly tied to implementation), but also as the model that the hardware itself uses internally in its microarchitecture.

So in some respects I think it may be misleading to talk about "memory ordering" in the classic sense - where you think of it as a "ordering on the memory bus" kind of thing. That ordering is completely irrelevant to the ISA except in the simplest and most naive implementations (and it's obviously where many of the original orderings came from: the easiest way to explain the x86 ordering rules is probably to just think of them as historical implementations).

The CPU already doesn't actually do its internal accesses in the terms that the individual instructions work. When you do a "load byte into register", at no point does the hardware really work at a byte level. It's all about the cacheline movements, and the "byte" part is almost entirely irrelevant and just a final small detail in how the data gets presented.

And similarly, when you have several consecutive memory operations as instructions, you don't really need to "order" them as per your memory ordering. In the simplest case, if you have two instructions that load (or store) to the same cacheline (or same store buffer entry), you can just combine them into one actual memory access. There's no "ordering" left when you do that, and you have basically turned those two instructions into the simplest kind of transaction you can image.

But that "transactional" view really doesn't have to stop at that trivial case. You can - and I will argue strongly that you should - expand it to a bigger model where you can handle multiple outstanding accesses using a "memory transaction unit" that makes sure that they are all treated as proper transactions, and just aborts and retries whenever a later (in the instruction stream) instruction fails the transaction test.

And the nice part of this, is that because you never exposed the transactions at an architectural level (and I really do think it's a mistake to do that), that "transaction manager" really has a relatively easy time of it. There's no user-visible transaction boundaries, so you can just always abort at any time you were going to retire an instruction that the transaction manager noticed had violated the transaction rules.

So no livelocks, no huge conceptual problems like that. No "I can't complete this transaction, now you need to have a different fallback". Just a simple "uhhuh, I can't retire you and need to force a restart, because that would violate the transaction rules with a previously retired instruction".

(And by "simple", I mean "you will have a lot of complexity to make sure you don't speculate so wildly that you cause too many abort-and-restart cases", so this presumably involves a complex speculation predictor, but it's conceptually simple and avoids all the problems with real user-visible transactions).

And if you do that, sequential consistency (or any other kind of consistency) doesn't come from the ordering of your memory operations, it comes out purely as a result of the rules of your memory transaction unit. The "physical" order on the bus and the cache protocol is almost entirely separated from the "virtual" order of the instruction stream.

Maybe you start out with a "transaction unit" that is just a single cacheline (or even just a partial one depending on internal bus widths etc), which really only allows those simple merging cases.

But it doesn't have to be that simple. As mentioned, nVidia may have had those code translation reasons to have a transaction manager anyway, and then the serial consistency may have just fallen out of that.

Linus
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Sequential consistency in hardwarenever_released2020/08/03 07:44 AM
  Sequential consistency in hardwareLinus Torvalds2020/08/03 09:19 AM
    Sequential consistency in hardwareJon Masters2020/08/03 04:22 PM
      Sequential consistency in hardwareGeert Bosch2020/08/03 07:48 PM
        Sequential consistency in hardwareTravis Downs2020/08/03 08:08 PM
          Sequential consistency in hardwareLinus Torvalds2020/08/03 10:20 PM
            Sequential consistency in hardwareLinus Torvalds2020/08/04 11:56 AM
              Sequential consistency in hardwarenever_released2020/08/04 02:03 PM
            Sequential consistency in hardwareVeedrac2020/08/05 11:54 AM
              Sequential consistency in hardwareDoug S2020/08/05 02:36 PM
                Sequential consistency in hardwareanon22020/08/05 03:06 PM
          Sequential consistency in hardwareAnon2020/08/04 07:02 AM
        Sequential consistency in hardwaredmcq2020/08/04 09:27 AM
          Sequential consistency in hardwareKonrad Schwarz2020/08/05 05:03 AM
  Sequential consistency in hardwareTravis Downs2020/08/03 06:58 PM
    Sequential consistency in hardwaregpd2020/08/04 02:19 AM
    Sequential consistency in hardwareJeff S.2020/08/04 10:11 PM
      Sequential consistency in hardwareTravis Downs2020/08/05 12:04 PM
        Sequential consistency in hardwareJeff S.2020/08/05 02:52 PM
          typoJeff S.2020/08/05 02:55 PM
          Sequential consistency in hardwareTravis Downs2020/08/05 06:39 PM
            Sequential consistency in hardwareJeff S.2020/08/05 07:43 PM
  Binary translationDavid Kanter2020/08/03 08:19 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?