CMPXCHG - all or nothing

By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), March 19, 2008 5:21 pm
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 3/19/08 wrote:
>
>According to my understanding, the main problem with x86
>synchronization primitives is that programmer is provided
>with either very lightweight or very heavyweight options
>when it would be nice to have something in the middle.

That's somewhat true.

I say "somewhat", because it's definitely the case that
the non-locking ones (which are only interrupt-safe, not
thread-safe) are really cheap, and the locking ones are
very expensive in comparison to the cheap ones.

BUT - and this is a big but - even the locking ones are
not actually really very expensive compared to many other
architectures doing the same things using other methods.

As already mentioned, LL/SC can be much more
expensive than a locked x86 instruction has ever been if
you just do it wrong, as it definitely has been done.

It all comes down to "implementation issue", and if there
is something that x86 shines at, it's all that engineering
that eventually gets done on all those implementation
issues.

>The heavyweight option (CMPXCHG with lock) guarantees
>both MP atomicity and "a total order" relatively to all
>other locked instructions, even at totally unrelated
>memory locations.

Yes and no.

It guarantees it from a software standpoint (the same
way Java guarantees a strongly ordered memory model), but
that by no means means that the hardware has to serialize
things completely.

Also, the x86 memory model is not a machine-wide
complete ordering - the memory model is called "processor
ordering", and it means that each CPU sees the stores of
other individual CPU's in a well-defined order.

See the Intel memory ordering whitepaper for all the gritty
details (just google for that phrase, the first hit will
be it).

>That's o.k. for 2-socket machine or for tightly-coupled
>4-socket, but I would imagine that on Altix-sized
>directory-based NUMA implementing the total order costs
>you more than microsecond. Or, may be, I overlooked some
>neat trick?

You overlooked two things, one trivial and one clever.

The trivial one is that the actual locked operations are
just done in the cache, and imply no actual barrier
semantics wrt other CPU's what-so-ever. They are only
serializing within that local CPU, and they absolutely
do not get more expensive with more CPU's.

(Of course, with lots of CPU's, it's more likely that you
will have cache contention on locks etc, and the cacheline
may be more likely to be on another CPU, so in that sense
it can and does get slower, but that's not an instruction
or ISA issue, that's just a generic property of any
shared access).

So the fact that a locked instruction is serializing only
matters for that one CPU - it does mean that the
trivial (stupid) solution is to make sure that the
instruction pipeline is flushed and all preceding writes
have moved from the write buffers into the cache.

This is why a locked instruction is pretty expensive on a
P4 (but still cheaper than LL/SC was on early alphas!): the
netburst architecture did that stupid thing with basically
a full pipeline flush and the pipelines were long.

But on many other x86 implementations a locked instruction
is literally just on the order of a couple of tens of
cycles, and the P4 really is the odd man out (not just in
this area - there's a lot of ops that it takes hundreds of
cycles at because it hits some random rough patch).

I think the AMD K8 does a locked cycle in 12 cycles in
many cases, for example, and while Core 2 takes more, it's
not orders of magnitude more, it's something like thirty
cycles, iirc. So we're still talking much much less than
a cache miss.

(Of course, locking does also get lots of cache
misses in many loads, but again, that's not an ISA issue,
that's just the nature of locking. There are other loads
where contention and cache misses almost never happen,
but you need to do the locking just because you have to
allow for the fact that they might happen).

However, the subtle thing you are missing is that even
that full pipeline flush is actually unnecessary. You only
need to serialize the pipeline if you can't keep track of
the ordering other ways than by just disabling everything
that might cause operations to reorder.

But if you are clever, what you do is to re-order
wildly all your operations as long as they hit in your
cache, and only care about being careful about the ordering
when it might be visible to other CPU's - when you have a
cache miss (or eviction due to a cache probe from outside).

So the x86 semantics are not actually horrible even now
(less so than others have done with "better" architectures)
and they can be made better. Which it sounds like Nehalem
is doing.

So in theory, a locked instruction should basically have
zero overhead over a non-locked one - but it does require
some more tracking resources to do that.

So to make an analogy: a conditional branch could cause
a full pipeline stall until the condition is resolved,
because the software-visible rules are that the branch
is "serializing" - you don't partially execute one side
until it's all been resolved, you do one side or the other,
very black-and-white.

But there are obviously ways to avoid that "serialization"
of the pipeline by using branch prediction, if you are just
willing to go to the effort of tracking which instructions
are done speculatively and can sort it out later if you were
wrong.

The same is true of the strictly serializing software
semantics of a locked x86 instruction. You don't have
to stall the pipeline and serialize the memory queues. If
you're just willing and able to track the ordering of all
the memory accesses you do speculatively around the locked
instruction, and can sort it out later if you did them
in the wrong order.

In practice, nobody has ever done that for x86. Yet. Let's
see how close Nehalem gets.

Linus
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Nehalem Architecture: Improvements Detailed Blut Aus Nord2008/03/17 02:52 PM
  Nehalem Architecture: Improvements Detailed bah2008/03/17 04:45 PM
    Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/17 06:14 PM
      Nehalem Architecture: Improvements Detailed Gabriele Svelto2008/03/18 01:11 AM
        Nehalem Architecture: Improvements Detailed Henrik S2008/03/18 04:23 AM
        Nehalem Architecture: Improvements Detailed Doug Siebert2008/03/18 09:48 PM
          Nehalem Architecture: Improvements Detailed anon2008/03/18 10:37 PM
            Nehalem Architecture: Improvements Detailed Doug Siebert2008/03/19 05:23 PM
          Nehalem Architecture: Improvements Detailed Ian Ollmann2008/03/19 08:15 AM
            SSE 4.2 Michael S2008/03/19 04:13 PM
              SSE 4.2 Ian Ollmann2008/03/20 09:56 AM
              SSE 4.2 anonymous2008/03/20 12:29 PM
                SSE 4.2 David W. Hess2008/03/21 07:24 AM
                  SSE 4.2 anonymous2008/03/22 07:27 AM
      CMPXCHG latencyDavid Kanter2008/03/28 05:59 PM
        CMPXCHG latencyanonymous coward2008/03/28 10:24 PM
          CMPXCHG latencyDavid Kanter2008/03/28 10:26 PM
            CMPXCHG latencyLinus Torvalds2008/03/29 11:43 AM
              CMPXCHG latencyDavid W. Hess2008/03/29 11:56 AM
              CMPXCHG latencyLinus Torvalds2008/03/29 02:17 PM
                CMPXCHG latencyGabriele Svelto2008/03/31 12:25 AM
                  CMPXCHG latencyMichael S2008/03/31 12:38 AM
                    CMPXCHG latencynick2008/03/31 12:52 AM
                      CMPXCHG latencyMichael S2008/03/31 01:51 AM
                        CMPXCHG latencyGabriele Svelto2008/03/31 02:08 AM
                        CMPXCHG latencynick2008/03/31 07:20 PM
                          CMPXCHG latencyMichael S2008/04/01 01:14 AM
                            CMPXCHG latencynick2008/04/01 02:34 AM
                    CMPXCHG latencyLinus Torvalds2008/03/31 10:16 AM
                      CMPXCHG latencyAaron Spink2008/03/31 07:15 PM
                        CMPXCHG latencynick2008/03/31 07:34 PM
                        CMPXCHG latencyLinus Torvalds2008/04/01 08:25 AM
                          CMPXCHG latencyZan2008/04/01 09:54 PM
                          CMPXCHG latencyZan2008/04/02 12:11 AM
                            CMPXCHG latencyLinus Torvalds2008/04/02 08:04 AM
                              CMPXCHG latencyZan2008/04/02 11:02 AM
                                CMPXCHG latencyLinus Torvalds2008/04/02 12:02 PM
                                  CMPXCHG latencyZan2008/04/02 04:15 PM
                      CMPXCHG latencyMichael S2008/04/01 01:26 AM
                        CMPXCHG latencyLinus Torvalds2008/04/01 07:08 AM
                CMPXCHG latency - Intel sourceWouter Tinus2008/04/02 12:36 PM
                  CMPXCHG latency - Intel sourceLinus Torvalds2008/04/02 02:21 PM
                    CMPXCHG latency - Intel sourceDavid Kanter2008/04/02 02:39 PM
    Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/19 01:11 PM
      Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/19 01:43 PM
        CMPXCHG - all or nothingMichael S2008/03/19 03:49 PM
          multithreading - all or nothingno@thanks.com2008/03/19 05:17 PM
          CMPXCHG - all or nothingLinus Torvalds2008/03/19 05:21 PM
            CMPXCHG - all or nothingMichael S2008/03/20 06:38 AM
              CMPXCHG - all or nothingLinus Torvalds2008/03/20 08:45 AM
                CMPXCHG - all or nothingMichael S2008/03/21 07:08 AM
                  CMPXCHG - all or nothingLinus Torvalds2008/03/21 08:47 AM
            CMPXCHG - all or nothingHenrik S2008/03/20 10:09 AM
              CMPXCHG - all or nothingLinus Torvalds2008/03/20 10:53 AM
                CMPXCHG - all or nothingHenrik S2008/03/20 12:03 PM
                  CMPXCHG - all or nothingLinus Torvalds2008/03/20 01:12 PM
                    CMPXCHG - all or nothingHenrik S2008/03/21 12:13 AM
                      CMPXCHG - all or nothingGabriele Svelto2008/03/21 01:22 AM
        Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/19 06:28 PM
          Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/19 07:42 PM
            Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/20 06:03 PM
              Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/20 06:33 PM
                Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/25 06:37 AM
                  Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/25 08:52 AM
                    What is DCAS? (NT)David Kanter2008/03/25 10:13 AM
                      Double compare-and-exchangeHenrik S2008/03/25 10:57 AM
                        Double compare-and-exchangeLinus Torvalds2008/03/25 11:38 AM
                          Double compare-and-exchangesavantu2008/03/25 01:54 PM
                            Double compare-and-exchangeLinus Torvalds2008/03/25 04:09 PM
                              Double compare-and-exchangeJamie Lucier2008/03/25 08:55 PM
                                Double compare-and-exchangesavantu2008/03/25 09:15 PM
                                  Double compare-and-exchangeHenrik S2008/03/26 08:40 AM
                                    Double compare-and-exchangeArun Ramakrishnan2008/03/27 02:07 AM
                                      Double compare-and-exchangeHenrik S2008/03/27 04:45 AM
                                  Surely GPL applies ?Richard Cownie2008/03/26 10:05 AM
                                    Surely GPL applies ?anon2008/03/26 02:58 PM
                                    Surely GPL applies ?Paul2008/03/26 05:01 PM
                                Double compare-and-exchangesomeone2008/03/25 09:18 PM
                                  Double compare-and-exchangeArun Ramakrishnan2008/03/27 02:03 AM
                                    Double compare-and-exchangesavantu2008/03/27 03:01 AM
                                      Double compare-and-exchangeArun Ramakrishnan2008/03/30 09:09 AM
                                        Double compare-and-exchangesavantu2008/03/30 09:59 AM
                                Double compare-and-exchangeLinus Torvalds2008/03/26 10:50 AM
                                  Double compare-and-exchangeanon2008/03/26 04:47 PM
                                  Double compare-and-exchangePaul2008/03/26 05:07 PM
                          Double compare-and-exchangeHoward Chu2008/03/25 05:18 PM
  Nehalem Architecture: Improvements Detailed Mr. Camel2008/03/17 08:50 PM
    Nehalem Architecture: Improvements Detailed anonymous2008/03/17 09:20 PM
  TFP will finally come :-)Paul A. Clayton2008/03/18 12:56 PM
  Nehalem Architecture: Improvements Detailed IntelUser20002008/03/27 07:46 PM
    Nehalem Architecture: Improvements Detailed David Kanter2008/03/27 10:21 PM
      Nehalem Architecture: Improvements Detailed nick2008/03/27 11:06 PM
        Nehalem Architecture: Improvements Detailed David Kanter2008/03/28 02:45 PM
          Nehalem Architecture: Improvements Detailed nick2008/03/28 07:52 PM
  L1 I-cachepuzzled2008/04/01 07:53 AM
    L1 I-cacheS. Rao2008/04/01 09:47 AM
    L1 I-cacherwessel2008/04/01 12:23 PM
    L1 I-cacheGabriele Svelto2008/04/03 12:30 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊