Sequential consistency in hardware

By: never_released (never_released.delete@this.gmx.tw), August 4, 2020 2:03 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 4, 2020 12:56 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 3, 2020 11:20 pm wrote:
> >
> > In the case of nVidia, who knows? I don't have any inside knowledge, but if they were working
> > on something like transactional memory due to their binary translation efforts, they may have
> > ended up with memory units where sequential consistency simply falls out of that work.
> >
> > At that point it's not an "expense" any more. At that point it's suddenly possibly a performance advantage.
>
> Side note, because this seems to confuse a lot of people: "memory
> ordering" doesn't necessarily actually imply the "ordering" part.
>
> I think that's something that seems to constantly trip up some of the proponents of weakly ordered memory
> units, because they get hung up on the "ordering" part, and get hung up on that as the implementation.
>
> But the best way to guarantee a certain "ordering" isn't actually to order anything at all. Not
> at the DRAM interface, not at the cache coherency layer, and not even in the memory units.
>
> No, the best way to guarantee an "ordering" is to have a model where nobody can tell any different.
> This is why transactional memory comes into the picture, because if you start talking in terms of "transactions",
> the people who got hung up on the "ordering" part get unstuck, and get the point.
>
> Now, I happen to think that transactional memory as an ISA feature tends to be a worthless
> exercise. I can only refer you to the discussions about why load-locked/store-conditional
> isn't actually a great interface. Transactional memory as an interface has all the same
> problems that load-locked/store-conditional has, except it dials them up to 11.
>
> So I'm not a huge believer in transactional memory as an exposed ISA, but both as a conceptual
> model for memory ordering, and as an internal implementation it's a great model.
>
> That internal implementation might be as part of a software translation layer (so it might be part
> of the internal ISA that you don't expose to users, because it's too tightly tied to implementation),
> but also as the model that the hardware itself uses internally in its microarchitecture.
>
> So in some respects I think it may be misleading to talk about "memory ordering" in the classic
> sense - where you think of it as a "ordering on the memory bus" kind of thing. That ordering
> is completely irrelevant to the ISA except in the simplest and most naive implementations (and
> it's obviously where many of the original orderings came from: the easiest way to explain the
> x86 ordering rules is probably to just think of them as historical implementations).
>
> The CPU already doesn't actually do its internal accesses in the terms that the individual
> instructions work. When you do a "load byte into register", at no point does the hardware
> really work at a byte level. It's all about the cacheline movements, and the "byte" part is
> almost entirely irrelevant and just a final small detail in how the data gets presented.
>
> And similarly, when you have several consecutive memory operations as instructions, you don't really
> need to "order" them as per your memory ordering. In the simplest case, if you have two instructions
> that load (or store) to the same cacheline (or same store buffer entry), you can just combine them
> into one actual memory access. There's no "ordering" left when you do that, and you have basically
> turned those two instructions into the simplest kind of transaction you can image.
>
> But that "transactional" view really doesn't have to stop at that trivial case. You can - and I will argue
> strongly that you should - expand it to a bigger model where you can handle multiple outstanding accesses
> using a "memory transaction unit" that makes sure that they are all treated as proper transactions, and just
> aborts and retries whenever a later (in the instruction stream) instruction fails the transaction test.
>
> And the nice part of this, is that because you never exposed the transactions at an architectural level (and
> I really do think it's a mistake to do that), that "transaction manager" really has a relatively easy time
> of it. There's no user-visible transaction boundaries, so you can just always abort at any time you were going
> to retire an instruction that the transaction manager noticed had violated the transaction rules.
>
> So no livelocks, no huge conceptual problems like that. No "I can't complete this transaction, now
> you need to have a different fallback". Just a simple "uhhuh, I can't retire you and need to force
> a restart, because that would violate the transaction rules with a previously retired instruction".
>
> (And by "simple", I mean "you will have a lot of complexity to make sure you don't speculate so wildly that
> you cause too many abort-and-restart cases", so this presumably involves a complex speculation predictor,
> but it's conceptually simple and avoids all the problems with real user-visible transactions).
>
> And if you do that, sequential consistency (or any other kind of consistency) doesn't come
> from the ordering of your memory operations, it comes out purely as a result of the rules
> of your memory transaction unit. The "physical" order on the bus and the cache protocol
> is almost entirely separated from the "virtual" order of the instruction stream.
>
> Maybe you start out with a "transaction unit" that is just a single cacheline (or even just a partial
> one depending on internal bus widths etc), which really only allows those simple merging cases.
>
> But it doesn't have to be that simple. As mentioned, nVidia may have had those code translation reasons
> to have a transaction manager anyway, and then the serial consistency may have just fallen out of that.
>
> Linus

Hello,

Those Nvidia processors indeed have a fully out-of-order memory subsystem, documented as:
> The Carmel memory system is essentially a fully out-of-order memory system, which does not preserve order between independent loads. For cacheable loads, this is not problematic because the coupling between the commitment logic and coherence snoops resolves potential ordering issues.

Tegra X2, its predecessor, didn't have sequential consistency on the Denver2 cores despite the dynamic binary translation design.
However, Tegra X2 has a coherent instruction cache on the Denver2 cores, unlike Tegra Xavier on the Carmel cores... which has a non-coherent instruction cache.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Sequential consistency in hardwarenever_released2020/08/03 07:44 AM
  Sequential consistency in hardwareLinus Torvalds2020/08/03 09:19 AM
    Sequential consistency in hardwareJon Masters2020/08/03 04:22 PM
      Sequential consistency in hardwareGeert Bosch2020/08/03 07:48 PM
        Sequential consistency in hardwareTravis Downs2020/08/03 08:08 PM
          Sequential consistency in hardwareLinus Torvalds2020/08/03 10:20 PM
            Sequential consistency in hardwareLinus Torvalds2020/08/04 11:56 AM
              Sequential consistency in hardwarenever_released2020/08/04 02:03 PM
            Sequential consistency in hardwareVeedrac2020/08/05 11:54 AM
              Sequential consistency in hardwareDoug S2020/08/05 02:36 PM
                Sequential consistency in hardwareanon22020/08/05 03:06 PM
          Sequential consistency in hardwareAnon2020/08/04 07:02 AM
        Sequential consistency in hardwaredmcq2020/08/04 09:27 AM
          Sequential consistency in hardwareKonrad Schwarz2020/08/05 05:03 AM
  Sequential consistency in hardwareTravis Downs2020/08/03 06:58 PM
    Sequential consistency in hardwaregpd2020/08/04 02:19 AM
    Sequential consistency in hardwareJeff S.2020/08/04 10:11 PM
      Sequential consistency in hardwareTravis Downs2020/08/05 12:04 PM
        Sequential consistency in hardwareJeff S.2020/08/05 02:52 PM
          typoJeff S.2020/08/05 02:55 PM
          Sequential consistency in hardwareTravis Downs2020/08/05 06:39 PM
            Sequential consistency in hardwareJeff S.2020/08/05 07:43 PM
  Binary translationDavid Kanter2020/08/03 08:19 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?