CMPXCHG - all or nothing

By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), March 20, 2008 8:45 am
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 3/20/08 wrote:
>
>I did read Intel memory ordering whitepaper before posting
>the original post. It clearly states that locked operations
>from different agents are ordered in absolute machine-wide
>order (i.e. seen in the same order by all observers) rather
>than in "processor order" like the rest of the WB stores.

Ahh, I think that is just a natural outcome of the all the
other rules - ie loads are not re-ordered wrt each other,
and stores also are guaranteed to be seen in the same order.

Couple that with the fact that locked instructions also
disable some of the reordering that is allowed (ie
the local store-to-load forwarding and loads passing
stores), and you end up with total ordering for the locked
ops.

>I'm too stupid to figure out whether semantics like these
>constrain possible implementations of big hierarchical
>directory based NUMA machines.

Hey, being too stupid to understand all the implications of
memory ordering means that pretty exactly 100% of humanity
is too stupid. Those things are really really subtle, but I
do think that the 2.7 rule is not actually fundamental, but
springs directly from the previous rules together with
the fact that atomic ops hold on to the cacheline.

And no, none of it needs non-local information. Basically,
here's the rule:

- in order for a CPU to do a locked op, it has to have
that cacheline exclusively and continuously for the read
and write (that "continuously" part is different: a
non-locked access can split the read and write and the
read could be satisfied earlier and then the cacheline
lost and regained).

- in order for another CPU to see the result, it obviously
will have to have that cacheline too.

- now, if you track the cacheline ownership for that 2.7
case, the disallowed case simply cannot happen.

Why? Because processors 2 and 3 can only reorder their
reads to _x and _y if they keep them in their cache across
the execution of both instructions (otherwise, the load
reordering would be architecturally visible, which is
against 2.1).

Think of _x and _y as being cachelines, and the locked
ops as being "guaranteed sequence point for that cacheline",
simply because it's owned exclusively by that CPU for
that one whole instruction.

So in order for the illegal case in 2.7 to happen (r3=1,
r4=0, r5=1, r6=0), we know that the movement for cacheline
_x must have been: P3x -> P0x -> P2x. And similarly for _y
it must have been P2y -> P1y ->P3y. Agreed?

That's all possible, because we're talking about two
different cachelines, and they can obviously move around
their ownership independently of each other.

But now, look at P2: because of the load order requirement,
we know that P2 must have held _x before it held _y,
because otherwise the load ordering (rule 2.1) is no longer
valid. So we know that P2x -> P2y. Similarly, for the same
reasons, we know that P3y -> P3x.

And this is where x86 is different from your regular weakly
ordered read-rule CPU. If you allow reads to be visibly
reordered wrt other reads, those P2x -> P2y and P3y -> P3x
guarantees do not exist, and the 2.7 guarantee doesn't hold.

But on x86 you do know that either P2 held both cachelines
over both instructions or it did the P2y fill later than P2x
(and the same for P3), because anything else would violate
the local load ordering constraints of 2.1 and the CPU could
be architecturally seen to re-order loads.

(Note the importance of "architecturally". P2 can
do load re-ordering internally, but only if it holds on to
the cachelines in question across the whole sequence, which
makes the reordering architecturally invisible since
clearly no other CPU could have modified those values
and shown an ordering violation!)

So now you have an impossible sequence. There's no way
we can satisfy all of

P3x -> P0x -> P2x (cacheline _x movement order)
P2x -> P2y (CPU2 cacheline load order)
P2y -> P1y -> P3y (cacheline _y movement order)
P3y -> P3x (CPU3 cacheline load order)

because that would be a circular dependency (P3x -> .. ->
P3x).

And notice how all these things are "local" decisions. All
of those ordering points are decisions about a single
cacheline state change at a single CPU, there's no "global
ordering" required for the above argument, every point is
just a local decision.

But yeah, memory ordering is really easy to get wrong, so
maybe I have a thinko there somewhere. But basically I
claim that it falls out of just the cache coherency rules
and the new stricter load ordering that Intel guarantees.

Linus
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Nehalem Architecture: Improvements Detailed Blut Aus Nord2008/03/17 02:52 PM
  Nehalem Architecture: Improvements Detailed bah2008/03/17 04:45 PM
    Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/17 06:14 PM
      Nehalem Architecture: Improvements Detailed Gabriele Svelto2008/03/18 01:11 AM
        Nehalem Architecture: Improvements Detailed Henrik S2008/03/18 04:23 AM
        Nehalem Architecture: Improvements Detailed Doug Siebert2008/03/18 09:48 PM
          Nehalem Architecture: Improvements Detailed anon2008/03/18 10:37 PM
            Nehalem Architecture: Improvements Detailed Doug Siebert2008/03/19 05:23 PM
          Nehalem Architecture: Improvements Detailed Ian Ollmann2008/03/19 08:15 AM
            SSE 4.2 Michael S2008/03/19 04:13 PM
              SSE 4.2 Ian Ollmann2008/03/20 09:56 AM
              SSE 4.2 anonymous2008/03/20 12:29 PM
                SSE 4.2 David W. Hess2008/03/21 07:24 AM
                  SSE 4.2 anonymous2008/03/22 07:27 AM
      CMPXCHG latencyDavid Kanter2008/03/28 05:59 PM
        CMPXCHG latencyanonymous coward2008/03/28 10:24 PM
          CMPXCHG latencyDavid Kanter2008/03/28 10:26 PM
            CMPXCHG latencyLinus Torvalds2008/03/29 11:43 AM
              CMPXCHG latencyDavid W. Hess2008/03/29 11:56 AM
              CMPXCHG latencyLinus Torvalds2008/03/29 02:17 PM
                CMPXCHG latencyGabriele Svelto2008/03/31 12:25 AM
                  CMPXCHG latencyMichael S2008/03/31 12:38 AM
                    CMPXCHG latencynick2008/03/31 12:52 AM
                      CMPXCHG latencyMichael S2008/03/31 01:51 AM
                        CMPXCHG latencyGabriele Svelto2008/03/31 02:08 AM
                        CMPXCHG latencynick2008/03/31 07:20 PM
                          CMPXCHG latencyMichael S2008/04/01 01:14 AM
                            CMPXCHG latencynick2008/04/01 02:34 AM
                    CMPXCHG latencyLinus Torvalds2008/03/31 10:16 AM
                      CMPXCHG latencyAaron Spink2008/03/31 07:15 PM
                        CMPXCHG latencynick2008/03/31 07:34 PM
                        CMPXCHG latencyLinus Torvalds2008/04/01 08:25 AM
                          CMPXCHG latencyZan2008/04/01 09:54 PM
                          CMPXCHG latencyZan2008/04/02 12:11 AM
                            CMPXCHG latencyLinus Torvalds2008/04/02 08:04 AM
                              CMPXCHG latencyZan2008/04/02 11:02 AM
                                CMPXCHG latencyLinus Torvalds2008/04/02 12:02 PM
                                  CMPXCHG latencyZan2008/04/02 04:15 PM
                      CMPXCHG latencyMichael S2008/04/01 01:26 AM
                        CMPXCHG latencyLinus Torvalds2008/04/01 07:08 AM
                CMPXCHG latency - Intel sourceWouter Tinus2008/04/02 12:36 PM
                  CMPXCHG latency - Intel sourceLinus Torvalds2008/04/02 02:21 PM
                    CMPXCHG latency - Intel sourceDavid Kanter2008/04/02 02:39 PM
    Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/19 01:11 PM
      Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/19 01:43 PM
        CMPXCHG - all or nothingMichael S2008/03/19 03:49 PM
          multithreading - all or nothingno@thanks.com2008/03/19 05:17 PM
          CMPXCHG - all or nothingLinus Torvalds2008/03/19 05:21 PM
            CMPXCHG - all or nothingMichael S2008/03/20 06:38 AM
              CMPXCHG - all or nothingLinus Torvalds2008/03/20 08:45 AM
                CMPXCHG - all or nothingMichael S2008/03/21 07:08 AM
                  CMPXCHG - all or nothingLinus Torvalds2008/03/21 08:47 AM
            CMPXCHG - all or nothingHenrik S2008/03/20 10:09 AM
              CMPXCHG - all or nothingLinus Torvalds2008/03/20 10:53 AM
                CMPXCHG - all or nothingHenrik S2008/03/20 12:03 PM
                  CMPXCHG - all or nothingLinus Torvalds2008/03/20 01:12 PM
                    CMPXCHG - all or nothingHenrik S2008/03/21 12:13 AM
                      CMPXCHG - all or nothingGabriele Svelto2008/03/21 01:22 AM
        Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/19 06:28 PM
          Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/19 07:42 PM
            Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/20 06:03 PM
              Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/20 06:33 PM
                Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/25 06:37 AM
                  Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/25 08:52 AM
                    What is DCAS? (NT)David Kanter2008/03/25 10:13 AM
                      Double compare-and-exchangeHenrik S2008/03/25 10:57 AM
                        Double compare-and-exchangeLinus Torvalds2008/03/25 11:38 AM
                          Double compare-and-exchangesavantu2008/03/25 01:54 PM
                            Double compare-and-exchangeLinus Torvalds2008/03/25 04:09 PM
                              Double compare-and-exchangeJamie Lucier2008/03/25 08:55 PM
                                Double compare-and-exchangesavantu2008/03/25 09:15 PM
                                  Double compare-and-exchangeHenrik S2008/03/26 08:40 AM
                                    Double compare-and-exchangeArun Ramakrishnan2008/03/27 02:07 AM
                                      Double compare-and-exchangeHenrik S2008/03/27 04:45 AM
                                  Surely GPL applies ?Richard Cownie2008/03/26 10:05 AM
                                    Surely GPL applies ?anon2008/03/26 02:58 PM
                                    Surely GPL applies ?Paul2008/03/26 05:01 PM
                                Double compare-and-exchangesomeone2008/03/25 09:18 PM
                                  Double compare-and-exchangeArun Ramakrishnan2008/03/27 02:03 AM
                                    Double compare-and-exchangesavantu2008/03/27 03:01 AM
                                      Double compare-and-exchangeArun Ramakrishnan2008/03/30 09:09 AM
                                        Double compare-and-exchangesavantu2008/03/30 09:59 AM
                                Double compare-and-exchangeLinus Torvalds2008/03/26 10:50 AM
                                  Double compare-and-exchangeanon2008/03/26 04:47 PM
                                  Double compare-and-exchangePaul2008/03/26 05:07 PM
                          Double compare-and-exchangeHoward Chu2008/03/25 05:18 PM
  Nehalem Architecture: Improvements Detailed Mr. Camel2008/03/17 08:50 PM
    Nehalem Architecture: Improvements Detailed anonymous2008/03/17 09:20 PM
  TFP will finally come :-)Paul A. Clayton2008/03/18 12:56 PM
  Nehalem Architecture: Improvements Detailed IntelUser20002008/03/27 07:46 PM
    Nehalem Architecture: Improvements Detailed David Kanter2008/03/27 10:21 PM
      Nehalem Architecture: Improvements Detailed nick2008/03/27 11:06 PM
        Nehalem Architecture: Improvements Detailed David Kanter2008/03/28 02:45 PM
          Nehalem Architecture: Improvements Detailed nick2008/03/28 07:52 PM
  L1 I-cachepuzzled2008/04/01 07:53 AM
    L1 I-cacheS. Rao2008/04/01 09:47 AM
    L1 I-cacherwessel2008/04/01 12:23 PM
    L1 I-cacheGabriele Svelto2008/04/03 12:30 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊