CMPXCHG - all or nothing

By: Michael S (already5chosen.delete@this.yahoo.com), March 21, 2008 7:08 am
Room: Moderated Discussions
To help other people following the discussion here is a link to the Intel® 64 (a.k.a. AMD64) Architecture Memory Ordering White Paper:
http://www.intel.com/products/processor/manuals/318147.pdf

Linus Torvalds (torvalds@linux-foundation.org) on 3/20/08 wrote:
---------------------------
>Michael S (already5chosen@yahoo.com) on 3/20/08 wrote:
>>
>>I did read Intel memory ordering whitepaper before posting
>>the original post. It clearly states that locked operations
>>from different agents are ordered in absolute machine-wide
>>order (i.e. seen in the same order by all observers) rather
>>than in "processor order" like the rest of the WB stores.
>
>Ahh, I think that is just a natural outcome of the all the
>other rules - ie loads are not re-ordered wrt each other,
>and stores also are guaranteed to be seen in the same order.
>
>Couple that with the fact that locked instructions also
>disable some of the reordering that is allowed (ie
>the local store-to-load forwarding and loads passing
>stores), and you end up with total ordering for the locked
>ops.
>
>>I'm too stupid to figure out whether semantics like these
>>constrain possible implementations of big hierarchical
>>directory based NUMA machines.
>
>Hey, being too stupid to understand all the implications of
>memory ordering means that pretty exactly 100% of humanity
>is too stupid. Those things are really really subtle, but I
>do think that the 2.7 rule is not actually fundamental, but
>springs directly from the previous rules together with
>the fact that atomic ops hold on to the cacheline.
>
>And no, none of it needs non-local information. Basically,
>here's the rule:
>
>- in order for a CPU to do a locked op, it has to have
>that cacheline exclusively and continuously for the read
>and write (that "continuously" part is different: a
>non-locked access can split the read and write and the
>read could be satisfied earlier and then the cacheline
>lost and regained).
>
>- in order for another CPU to see the result, it obviously
>will have to have that cacheline too.
>
>- now, if you track the cacheline ownership for that 2.7
>case, the disallowed case simply cannot happen.
>
>Why? Because processors 2 and 3 can only reorder their
>reads to _x and _y if they keep them in their cache across
>the execution of both instructions (otherwise, the load
>reordering would be architecturally visible, which is
>against 2.1).
>

I disagree. The rule 2.1 insists on "program order". There is no particular program order between (xchg [ _x], ...) on Processor 0 and (xchg [ _y],...) on Processor 1.

>Think of _x and _y as being cachelines, and the locked
>ops as being "guaranteed sequence point for that cacheline",
>simply because it's owned exclusively by that CPU for
>that one whole instruction.
>
>So in order for the illegal case in 2.7 to happen (r3=1,
>r4=0, r5=1, r6=0), we know that the movement for cacheline
>_x must have been: P3x -> P0x -> P2x. And similarly for _y
>it must have been P2y -> P1y ->P3y. Agreed?
>

No.
First, you assume a simple MESI or MESI-like implementation. For x86 it was the so far but I don't think that it is architected.
Second, even for MESI, in the initial state line _x could be shared between P0 and P3 or P0, P2 and P3. Same for line _y.

>That's all possible, because we're talking about two
>different cachelines, and they can obviously move around
>their ownership independently of each other.
>
>But now, look at P2: because of the load order requirement,
>we know that P2 must have held _x before it held _y,
>because otherwise the load ordering (rule 2.1) is no longer
>valid. So we know that P2x -> P2y. Similarly, for the same
>reasons, we know that P3y -> P3x.
>

No. 2.1. load ordering applies only to locations modified by the same processor.

>And this is where x86 is different from your regular weakly
>ordered read-rule CPU. If you allow reads to be visibly
>reordered wrt other reads, those P2x -> P2y and P3y -> P3x
>guarantees do not exist, and the 2.7 guarantee doesn't hold.
>

Same as above. In the absence of locks load ordering doesn't apply to the case of 3 or more processors

>But on x86 you do know that either P2 held both cachelines
>over both instructions or it did the P2y fill later than P2x

In MESI+broadcasted invalidats - yes. But that sort of implementation is not architected.
I can imagine implementation that satisfies 2.1. but allows out of order issue of external load requests. Especially, taking into account that P1 and P3 (as well as P0 and P2) could be two threads of the same core that share all levels of the cache hierarchy.

>(and the same for P3), because anything else would violate
>the local load ordering constraints of 2.1 and the CPU could
>be architecturally seen to re-order loads.
>
>(Note the importance of "architecturally". P2 can
>do load re-ordering internally, but only if it holds on to
>the cachelines in question across the whole sequence, which
>makes the reordering architecturally invisible since
>clearly no other CPU could have modified those values
>and shown an ordering violation!)
>
>So now you have an impossible sequence. There's no way
>we can satisfy all of
>
>P3x -> P0x -> P2x (cacheline _x movement order)
>P2x -> P2y (CPU2 cacheline load order)
>P2y -> P1y -> P3y (cacheline _y movement order)
>P3y -> P3x (CPU3 cacheline load order)
>
>because that would be a circular dependency (P3x -> .. ->
>P3x).
>
>And notice how all these things are "local" decisions. All
>of those ordering points are decisions about a single
>cacheline state change at a single CPU, there's no "global
>ordering" required for the above argument, every point is
>just a local decision.
>
>But yeah, memory ordering is really easy to get wrong, so
>maybe I have a thinko there somewhere. But basically I
>claim that it falls out of just the cache coherency rules
>and the new stricter load ordering that Intel guarantees.
>
>Linus

My head is hurting and I'm not speaking figuratively :( It's probably a sign that I should leave this discussion to more capable (and preferably younger) minds.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Nehalem Architecture: Improvements Detailed Blut Aus Nord2008/03/17 02:52 PM
  Nehalem Architecture: Improvements Detailed bah2008/03/17 04:45 PM
    Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/17 06:14 PM
      Nehalem Architecture: Improvements Detailed Gabriele Svelto2008/03/18 01:11 AM
        Nehalem Architecture: Improvements Detailed Henrik S2008/03/18 04:23 AM
        Nehalem Architecture: Improvements Detailed Doug Siebert2008/03/18 09:48 PM
          Nehalem Architecture: Improvements Detailed anon2008/03/18 10:37 PM
            Nehalem Architecture: Improvements Detailed Doug Siebert2008/03/19 05:23 PM
          Nehalem Architecture: Improvements Detailed Ian Ollmann2008/03/19 08:15 AM
            SSE 4.2 Michael S2008/03/19 04:13 PM
              SSE 4.2 Ian Ollmann2008/03/20 09:56 AM
              SSE 4.2 anonymous2008/03/20 12:29 PM
                SSE 4.2 David W. Hess2008/03/21 07:24 AM
                  SSE 4.2 anonymous2008/03/22 07:27 AM
      CMPXCHG latencyDavid Kanter2008/03/28 05:59 PM
        CMPXCHG latencyanonymous coward2008/03/28 10:24 PM
          CMPXCHG latencyDavid Kanter2008/03/28 10:26 PM
            CMPXCHG latencyLinus Torvalds2008/03/29 11:43 AM
              CMPXCHG latencyDavid W. Hess2008/03/29 11:56 AM
              CMPXCHG latencyLinus Torvalds2008/03/29 02:17 PM
                CMPXCHG latencyGabriele Svelto2008/03/31 12:25 AM
                  CMPXCHG latencyMichael S2008/03/31 12:38 AM
                    CMPXCHG latencynick2008/03/31 12:52 AM
                      CMPXCHG latencyMichael S2008/03/31 01:51 AM
                        CMPXCHG latencyGabriele Svelto2008/03/31 02:08 AM
                        CMPXCHG latencynick2008/03/31 07:20 PM
                          CMPXCHG latencyMichael S2008/04/01 01:14 AM
                            CMPXCHG latencynick2008/04/01 02:34 AM
                    CMPXCHG latencyLinus Torvalds2008/03/31 10:16 AM
                      CMPXCHG latencyAaron Spink2008/03/31 07:15 PM
                        CMPXCHG latencynick2008/03/31 07:34 PM
                        CMPXCHG latencyLinus Torvalds2008/04/01 08:25 AM
                          CMPXCHG latencyZan2008/04/01 09:54 PM
                          CMPXCHG latencyZan2008/04/02 12:11 AM
                            CMPXCHG latencyLinus Torvalds2008/04/02 08:04 AM
                              CMPXCHG latencyZan2008/04/02 11:02 AM
                                CMPXCHG latencyLinus Torvalds2008/04/02 12:02 PM
                                  CMPXCHG latencyZan2008/04/02 04:15 PM
                      CMPXCHG latencyMichael S2008/04/01 01:26 AM
                        CMPXCHG latencyLinus Torvalds2008/04/01 07:08 AM
                CMPXCHG latency - Intel sourceWouter Tinus2008/04/02 12:36 PM
                  CMPXCHG latency - Intel sourceLinus Torvalds2008/04/02 02:21 PM
                    CMPXCHG latency - Intel sourceDavid Kanter2008/04/02 02:39 PM
    Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/19 01:11 PM
      Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/19 01:43 PM
        CMPXCHG - all or nothingMichael S2008/03/19 03:49 PM
          multithreading - all or nothingno@thanks.com2008/03/19 05:17 PM
          CMPXCHG - all or nothingLinus Torvalds2008/03/19 05:21 PM
            CMPXCHG - all or nothingMichael S2008/03/20 06:38 AM
              CMPXCHG - all or nothingLinus Torvalds2008/03/20 08:45 AM
                CMPXCHG - all or nothingMichael S2008/03/21 07:08 AM
                  CMPXCHG - all or nothingLinus Torvalds2008/03/21 08:47 AM
            CMPXCHG - all or nothingHenrik S2008/03/20 10:09 AM
              CMPXCHG - all or nothingLinus Torvalds2008/03/20 10:53 AM
                CMPXCHG - all or nothingHenrik S2008/03/20 12:03 PM
                  CMPXCHG - all or nothingLinus Torvalds2008/03/20 01:12 PM
                    CMPXCHG - all or nothingHenrik S2008/03/21 12:13 AM
                      CMPXCHG - all or nothingGabriele Svelto2008/03/21 01:22 AM
        Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/19 06:28 PM
          Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/19 07:42 PM
            Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/20 06:03 PM
              Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/20 06:33 PM
                Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/25 06:37 AM
                  Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/25 08:52 AM
                    What is DCAS? (NT)David Kanter2008/03/25 10:13 AM
                      Double compare-and-exchangeHenrik S2008/03/25 10:57 AM
                        Double compare-and-exchangeLinus Torvalds2008/03/25 11:38 AM
                          Double compare-and-exchangesavantu2008/03/25 01:54 PM
                            Double compare-and-exchangeLinus Torvalds2008/03/25 04:09 PM
                              Double compare-and-exchangeJamie Lucier2008/03/25 08:55 PM
                                Double compare-and-exchangesavantu2008/03/25 09:15 PM
                                  Double compare-and-exchangeHenrik S2008/03/26 08:40 AM
                                    Double compare-and-exchangeArun Ramakrishnan2008/03/27 02:07 AM
                                      Double compare-and-exchangeHenrik S2008/03/27 04:45 AM
                                  Surely GPL applies ?Richard Cownie2008/03/26 10:05 AM
                                    Surely GPL applies ?anon2008/03/26 02:58 PM
                                    Surely GPL applies ?Paul2008/03/26 05:01 PM
                                Double compare-and-exchangesomeone2008/03/25 09:18 PM
                                  Double compare-and-exchangeArun Ramakrishnan2008/03/27 02:03 AM
                                    Double compare-and-exchangesavantu2008/03/27 03:01 AM
                                      Double compare-and-exchangeArun Ramakrishnan2008/03/30 09:09 AM
                                        Double compare-and-exchangesavantu2008/03/30 09:59 AM
                                Double compare-and-exchangeLinus Torvalds2008/03/26 10:50 AM
                                  Double compare-and-exchangeanon2008/03/26 04:47 PM
                                  Double compare-and-exchangePaul2008/03/26 05:07 PM
                          Double compare-and-exchangeHoward Chu2008/03/25 05:18 PM
  Nehalem Architecture: Improvements Detailed Mr. Camel2008/03/17 08:50 PM
    Nehalem Architecture: Improvements Detailed anonymous2008/03/17 09:20 PM
  TFP will finally come :-)Paul A. Clayton2008/03/18 12:56 PM
  Nehalem Architecture: Improvements Detailed IntelUser20002008/03/27 07:46 PM
    Nehalem Architecture: Improvements Detailed David Kanter2008/03/27 10:21 PM
      Nehalem Architecture: Improvements Detailed nick2008/03/27 11:06 PM
        Nehalem Architecture: Improvements Detailed David Kanter2008/03/28 02:45 PM
          Nehalem Architecture: Improvements Detailed nick2008/03/28 07:52 PM
  L1 I-cachepuzzled2008/04/01 07:53 AM
    L1 I-cacheS. Rao2008/04/01 09:47 AM
    L1 I-cacherwessel2008/04/01 12:23 PM
    L1 I-cacheGabriele Svelto2008/04/03 12:30 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊