CMPXCHG - all or nothing

By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), March 21, 2008 8:47 am
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 3/21/08 wrote:
>>
>>Why? Because processors 2 and 3 can only reorder their
>>reads to _x and _y if they keep them in their cache across
>>the execution of both instructions (otherwise, the load
>>reordering would be architecturally visible, which is
>>against 2.1).
>
>I disagree. The rule 2.1 insists on "program order". There
>is no particular program order between (xchg [ _x], ...)
>on Processor 0 and (xchg [ _y],...) on Processor 1.

Please read what I wrote.

I expressly talked about Processors 2 and 3. The ones that
do two loads.

Those loads are constrained to either have the two cache-
lines in the cache over both loads, or the second
load has to be loaded into the cache later. Otherwise you
would violate 2.1 on those CPU's.

It has absolutely nothing to do with the (single)
xchg on P0 and P1.

So the load reordering rules result in those two rules of

P2x -> P2y
P3x -> P3y

being a required ordering on those two CPU's. The argument
being that if you don't follow that simple rule, then you
are going to never be able to guarantee the load ordering
requirement of 2.1 without some magic external oracle (aka
complex global knowledge that is not appropriate for
the hottest path in the core!)

IOW, this is a practical rule, not a theoretical one. I'm
sure that in theory you could have some really
complex state machine with global knowledge, but I'm also
sure that in practice there is no way you can do that in
the memory pipeline any way close to reasonably.

>>So in order for the illegal case in 2.7 to happen (r3=1,
>>r4=0, r5=1, r6=0), we know that the movement for cacheline
>>_x must have been: P3x -> P0x -> P2x. And similarly for _y
>>it must have been P2y -> P1y ->P3y. Agreed?
>
>No.
>First, you assume a simple MESI or MESI-like implementation.

No, I do not.

I assume only cache coherency.

In order to do an atomic locked op, the CPU needs to hold
that cacheline exclusively in its cache. That is what I
assume, nothing more. It doesn't matter what the actual
implementation of the cache coherency protocol is.

>Second, even for MESI, in the initial state line _x could
>be shared between P0 and P3 or P0, P2 and P3. Same for
>line _y.

But that is irrelevant to my argument. Because of the locked
op, it will have to become unshared, and move to P0/P1 for
the duration of the locked op, and when it does so, it will
be exclusively at P0/P1.

In other words, the state before the xchg is irrelevant.

The only thing I'm saying is that the cachelines must
have moved like

P3x -> P0x -> P2x
P2y -> P1y -> P3y

where the middle case is exclusive. Yes, at the ends the
cacheline can also be shared with any number of
other CPU's, but that is not relevant to the argument. The
only part that is relevant is that at the points P0x and
P1y the cacheline cannot have been at any other CPU.

In other words, we knew the cacheline was at P3x (maybe
it was at some other place too but that doesn't
matter), and that it had to become exclusive at P0x, and
then went to P2x (and perhaps elsewhere).

We explicitly do not care about any "nonlocal"
information. The above looks only at two single cachelines
in individual CPU's, and looks at them as independent
entities.

So for example, P3x is literally "cacheline x in CPU3", it
does not imply that "cacheline x" might not exist
at any other CPU. But because of the values we see, we do
know that P3x (value=0) must have preceded P0x (xchange)
which must have preceded P2x (value=1) by only
depending on

- cache coherency
- causality

nothing else.

>>But now, look at P2: because of the load order requirement,
>>we know that P2 must have held _x before it held _y,
>>because otherwise the load ordering (rule 2.1) is no longer
>>valid. So we know that P2x -> P2y. Similarly, for the same
>>reasons, we know that P3y -> P3x.
>
>No. 2.1. load ordering applies only to locations modified
>by the same processor.

Again, irrelevant for my argument. You claimed that total
ordering was expensive. I showed that if you follow simple
rules, the total ordering is a local thing and not at all
expensive.

And the simple rule (for loads) is:
- you can reorder loads arbitrarily as long as the
values loaded stay in the cache (no evictions) over
the whole sequence
- but if you don't have the values in cache, the cache
loads must be in program order.

Again, you seem to be arguing from some "unknown global
ordering" standpoint. I'm not. I'm arguing entirely from
the local ordering standpoint that gives you the semantics
that Intel guarantees.

So I'm arguing that in order for a CPU to honor the Intel
memory ordering rules, it can act the way I explained,
without any real "global state" (the cache coherency can
arguably be considered "global state", but I take that
one for granted - it's simply immaterial what the actual
cache coherency implementation is, as long as it gives
you coherency).

IOW, all the rules are local, and there is no extra cost
as the number of CPU's go up (again, apart from the cost
of cache coherency itself!).

Could you perhaps make up non-local rules that a CPU
designer could also use to implement the Intel
memory ordering rules? Oh, I'm sure you could. But since
that's the last thing you actually want to have, what's
the point?

I was describing the algorithm that I think Intel uses,
and gives the expected end results.

That said, I expressly avoided talking about the
problem cases. The real problem case is the write buffer
before a write actually hits the cacheline, but that was
not germane to this discussion, since the whole point was
about the nature of locked instructions which are in turn
guaranteed to act as if the buffer didn't exist (and
whether that is done by actually draining the buffer, or
by having strict rules about the elements in the buffer
having to already be backed by exclusive cache locations
or similar is then an implementation detail).

The write buffer itself doesn't participate in the cache
coherency, which is why loads can pass local writes etc.

And no, I will not guarantee that I got everything right.
But I think that you disagree with my analysis on the wrong
grounds: you seem to build your argument based on the
memory ordering whitepaper itself (and no assumptions about
how it's implemented), while I build my argument from the
standpoint of what I think the Intel implementation rules
are - and am trying to show that you can implement the
memory ordering without any nonlocal decisions.

IOW, I'm just arguing that the cost of a locked instruction
doesn't go up with number of CPU's, and it's actually
pretty cheap. It's pretty cheap even on current CPU's (which
all seem to basically drain the write queues and the
instruction pipeline - a few tens of cycles), but I think
it can be cheaper still if you use smarter models than a
full drain.

The expense in the Intel memory ordering model is:

- it likely needs bigger caches (and more ways) to work
efficiently.

Reason: the code can only do a good job of reordering
operations when they are in the cache, so if you want
the same amount of core freedom as with a more weakly
ordered model, you'd better make sure you don't have
unnecessary cache evictions.

So weakly ordered probably makes more sense if you have
a small one-way associative L1 cache. Of course, those
are horribly broken for other reasons, so..

- the write buffer is clearly much more complex, since it
requires that whole ordering thing (and again, reordering
is possible, but only if you can guarantee to keep
reordered accesses exclusively in the cache over the
whole reordered sequence - that way any other CPU is
guaranteed to not be able to see the reordered state!)

but a locked instruction - given the above caveats - is not
all that expensive on its own.

The write buffers are where I think all the real issues
are. They aren't cache-coherent, but they are on a really
critical path, and need to be coherent within the single
CPU. They also have these ordering guarantees (which can
be broken, but only if you maintain a few rules so that the
breakage isn't architecturally visible), along with the
whole synchronization thing too.

Linus
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Nehalem Architecture: Improvements Detailed Blut Aus Nord2008/03/17 02:52 PM
  Nehalem Architecture: Improvements Detailed bah2008/03/17 04:45 PM
    Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/17 06:14 PM
      Nehalem Architecture: Improvements Detailed Gabriele Svelto2008/03/18 01:11 AM
        Nehalem Architecture: Improvements Detailed Henrik S2008/03/18 04:23 AM
        Nehalem Architecture: Improvements Detailed Doug Siebert2008/03/18 09:48 PM
          Nehalem Architecture: Improvements Detailed anon2008/03/18 10:37 PM
            Nehalem Architecture: Improvements Detailed Doug Siebert2008/03/19 05:23 PM
          Nehalem Architecture: Improvements Detailed Ian Ollmann2008/03/19 08:15 AM
            SSE 4.2 Michael S2008/03/19 04:13 PM
              SSE 4.2 Ian Ollmann2008/03/20 09:56 AM
              SSE 4.2 anonymous2008/03/20 12:29 PM
                SSE 4.2 David W. Hess2008/03/21 07:24 AM
                  SSE 4.2 anonymous2008/03/22 07:27 AM
      CMPXCHG latencyDavid Kanter2008/03/28 05:59 PM
        CMPXCHG latencyanonymous coward2008/03/28 10:24 PM
          CMPXCHG latencyDavid Kanter2008/03/28 10:26 PM
            CMPXCHG latencyLinus Torvalds2008/03/29 11:43 AM
              CMPXCHG latencyDavid W. Hess2008/03/29 11:56 AM
              CMPXCHG latencyLinus Torvalds2008/03/29 02:17 PM
                CMPXCHG latencyGabriele Svelto2008/03/31 12:25 AM
                  CMPXCHG latencyMichael S2008/03/31 12:38 AM
                    CMPXCHG latencynick2008/03/31 12:52 AM
                      CMPXCHG latencyMichael S2008/03/31 01:51 AM
                        CMPXCHG latencyGabriele Svelto2008/03/31 02:08 AM
                        CMPXCHG latencynick2008/03/31 07:20 PM
                          CMPXCHG latencyMichael S2008/04/01 01:14 AM
                            CMPXCHG latencynick2008/04/01 02:34 AM
                    CMPXCHG latencyLinus Torvalds2008/03/31 10:16 AM
                      CMPXCHG latencyAaron Spink2008/03/31 07:15 PM
                        CMPXCHG latencynick2008/03/31 07:34 PM
                        CMPXCHG latencyLinus Torvalds2008/04/01 08:25 AM
                          CMPXCHG latencyZan2008/04/01 09:54 PM
                          CMPXCHG latencyZan2008/04/02 12:11 AM
                            CMPXCHG latencyLinus Torvalds2008/04/02 08:04 AM
                              CMPXCHG latencyZan2008/04/02 11:02 AM
                                CMPXCHG latencyLinus Torvalds2008/04/02 12:02 PM
                                  CMPXCHG latencyZan2008/04/02 04:15 PM
                      CMPXCHG latencyMichael S2008/04/01 01:26 AM
                        CMPXCHG latencyLinus Torvalds2008/04/01 07:08 AM
                CMPXCHG latency - Intel sourceWouter Tinus2008/04/02 12:36 PM
                  CMPXCHG latency - Intel sourceLinus Torvalds2008/04/02 02:21 PM
                    CMPXCHG latency - Intel sourceDavid Kanter2008/04/02 02:39 PM
    Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/19 01:11 PM
      Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/19 01:43 PM
        CMPXCHG - all or nothingMichael S2008/03/19 03:49 PM
          multithreading - all or nothingno@thanks.com2008/03/19 05:17 PM
          CMPXCHG - all or nothingLinus Torvalds2008/03/19 05:21 PM
            CMPXCHG - all or nothingMichael S2008/03/20 06:38 AM
              CMPXCHG - all or nothingLinus Torvalds2008/03/20 08:45 AM
                CMPXCHG - all or nothingMichael S2008/03/21 07:08 AM
                  CMPXCHG - all or nothingLinus Torvalds2008/03/21 08:47 AM
            CMPXCHG - all or nothingHenrik S2008/03/20 10:09 AM
              CMPXCHG - all or nothingLinus Torvalds2008/03/20 10:53 AM
                CMPXCHG - all or nothingHenrik S2008/03/20 12:03 PM
                  CMPXCHG - all or nothingLinus Torvalds2008/03/20 01:12 PM
                    CMPXCHG - all or nothingHenrik S2008/03/21 12:13 AM
                      CMPXCHG - all or nothingGabriele Svelto2008/03/21 01:22 AM
        Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/19 06:28 PM
          Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/19 07:42 PM
            Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/20 06:03 PM
              Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/20 06:33 PM
                Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/25 06:37 AM
                  Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/25 08:52 AM
                    What is DCAS? (NT)David Kanter2008/03/25 10:13 AM
                      Double compare-and-exchangeHenrik S2008/03/25 10:57 AM
                        Double compare-and-exchangeLinus Torvalds2008/03/25 11:38 AM
                          Double compare-and-exchangesavantu2008/03/25 01:54 PM
                            Double compare-and-exchangeLinus Torvalds2008/03/25 04:09 PM
                              Double compare-and-exchangeJamie Lucier2008/03/25 08:55 PM
                                Double compare-and-exchangesavantu2008/03/25 09:15 PM
                                  Double compare-and-exchangeHenrik S2008/03/26 08:40 AM
                                    Double compare-and-exchangeArun Ramakrishnan2008/03/27 02:07 AM
                                      Double compare-and-exchangeHenrik S2008/03/27 04:45 AM
                                  Surely GPL applies ?Richard Cownie2008/03/26 10:05 AM
                                    Surely GPL applies ?anon2008/03/26 02:58 PM
                                    Surely GPL applies ?Paul2008/03/26 05:01 PM
                                Double compare-and-exchangesomeone2008/03/25 09:18 PM
                                  Double compare-and-exchangeArun Ramakrishnan2008/03/27 02:03 AM
                                    Double compare-and-exchangesavantu2008/03/27 03:01 AM
                                      Double compare-and-exchangeArun Ramakrishnan2008/03/30 09:09 AM
                                        Double compare-and-exchangesavantu2008/03/30 09:59 AM
                                Double compare-and-exchangeLinus Torvalds2008/03/26 10:50 AM
                                  Double compare-and-exchangeanon2008/03/26 04:47 PM
                                  Double compare-and-exchangePaul2008/03/26 05:07 PM
                          Double compare-and-exchangeHoward Chu2008/03/25 05:18 PM
  Nehalem Architecture: Improvements Detailed Mr. Camel2008/03/17 08:50 PM
    Nehalem Architecture: Improvements Detailed anonymous2008/03/17 09:20 PM
  TFP will finally come :-)Paul A. Clayton2008/03/18 12:56 PM
  Nehalem Architecture: Improvements Detailed IntelUser20002008/03/27 07:46 PM
    Nehalem Architecture: Improvements Detailed David Kanter2008/03/27 10:21 PM
      Nehalem Architecture: Improvements Detailed nick2008/03/27 11:06 PM
        Nehalem Architecture: Improvements Detailed David Kanter2008/03/28 02:45 PM
          Nehalem Architecture: Improvements Detailed nick2008/03/28 07:52 PM
  L1 I-cachepuzzled2008/04/01 07:53 AM
    L1 I-cacheS. Rao2008/04/01 09:47 AM
    L1 I-cacherwessel2008/04/01 12:23 PM
    L1 I-cacheGabriele Svelto2008/04/03 12:30 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊