CMPXCHG latency

By: Linus Torvalds (, April 1, 2008 8:25 am
Room: Moderated Discussions
Aaron Spink ( on 3/31/08 wrote:
>So the question becomes, if you could design your own
>hardware based locking interface/functionality what would
>it look like? CMPXCHG? Some sort of advanced multiple
>LL/SC/transactional memory?

Heh. If I could dream, I'd go for:

- atomic, small, fast and generic arithmetic atomic
instructions. x86 has this already, of course.

Notice how this isn't about locking, this is just about
other atomic ops.

This is because a lot of the atomic stuff is just to
do things like setting bits and incrementing counts, and
function call interfaces are not good for this. A simple
increment or decrement is much better in-line.

The one thing I wish x86 did was to not force that hard
serialization. Right now atomic counter updates are more
expensive than they need to be.

Thus my hopes for Nehalem.

- I don't believe in lockless algorithms outside of niche
areas, so I don't really believe in DCAS and friends.

- For locking, as for a lot of other things, I'd actually
prefer higher-level "CISCy" instructions rather than any
architecturally visible low-level uarch cleverness.

Why? Because I'm a big believer in OoO, I also think
that being too low-level is actually bad for performance.
I'd much rather take a "memset()" instruction and let
the CPU do what's right for that cache and uarch than
to have "nontemporal stores" and have software try to be
clever. The same is true of locking.

For example, you can do a lot of mental masturbation
about lockless doubly-linked lists, but I'd personally
much rather see an architecture make sure that

list_add(entry, list);

was really fast.

So in the code-example above, one of the issues is that
a lot of lockless algorithms may make it easy to add an
element to a list or something like that, but a lot of
real code is much easier if you also can do something
generic (like incrementing a count or whatever) that is
also guaranteed to be atomic wrt the list itself.

Now, x86 is not optimal at the above, but certainly
neither is some super-LL/SC. Transactions? I think that
they can be optimal as an implementation strategy, but they
are too close to the hw to be a good abstraction.

So what's the problem with x86? The thing is that the x86
locked semantics do require that exclusive cache ownership
of the lock (because the lock is always a modifying store,
and the semantics of locked instructions require that they
are fully serialized wrt other stores).

I'd like architectural spinlocks that basically give the
performance characteristics of a lockless algorithm, but
the semantics of locks. And I'd like it to happen
without having to be excessively fine-grained about the
locks themselves (the kernel is good about this, but we've
spent years - I don't think it's realistic to do locking
that fine-grained in a lot of other areas, and if people
really want parallel programming, they'd better realize

And yeah, you can tell me I'm asking for the moon, and yes,
you may be right, but the thing is, I think it's possible
to do, or at least come close.

I think you can get very very close with transactional
memory, but I really believe that transactional memory as
an exposed thing in the ISA is a huge mistake (except for
perhaps embedded CPU's etc). So I think transactions are
a good way to implement spinlocks, but at a uarch
level, not software-visible.

In particular, I'd like CPU's to literally be able to
share the cacheline that contains the lock, and not
bounce it between them. Then, the rules (if the CPU
micro-architecture does transactions) could be something

- trylock expands to
- return success (and start transaction atomically)
if lock location is clear
- OR just do an atomic "test-and-set-bit"

- transaction failure:
- roll back and force the test-and-set-bit version

- spinunlock:
- transaction close, or if the transaction failed,
do the traditional "clear the spinlock".

but I'd want this to
(a) if the uarch doesn't do transactions, just do the
locking as traditional atomic memory ops (ie don't
force sw to care about the microarchitecture).
(b) fail transactions and nest them invisibly (ie if you
only do a single level of transactions, fail the outer
one when you hit the inner one automatically, and fall
back and do just the inner one as a transaction)
(c) doing it well probably requires a "lock prediction
buffer", the same as branches. Because some trylocks
will always fail (because they overflow whatever
transaction limits you have - which should not
be architecturally visible!)
(d) the lock prediction in turn means that this probably
needs to be inlined to work well (so that you can do
the prediction based on instruction pointer). Which is
just another argument against making this a software
visible choice, or requiring anything but a simple
small single instruction for the 'trylock'/'unlock'.

In other words, I'd literally like to see just two new
instructions: "trylock" and "unlock". No visible
transactional memory. No visible lockless behavior.
Just an architected and fast trylock/unlock, nothing more
(yeah, you probably want a "islocked" instruction for
testing, since the software isn't supposed to read the
lock location to find out).

Just design it so that the hardware can do something
clever if it wants to, and so that it's invisible to sw
whether the hardware is clever or not. And sw can treat that
"trylock" literally as just an odd way of doing a normal
"test-and-set-bit" instruction, with the exact semantics
you'd see for a lock acquire.

And the thing is, I don't think you can reasonably do the
above with x86 locked instructions. The ordering constraints
of two locked instructions are such that I cannot see how
you can do it with the memory location no bouncing between

So even if the CPU can see the whole instruction sequence
between the locked instruction and the unlock (which is a
simple store to the lock word), and even if the CPU sees
that the final store (the unlock) will restore the same
value in the lock word that it had before the locked insn,
even then you cannot actually avoid getting the cacheline
exclusively, because you couldn't enforce the required x86
ordering without it.

IOW, I think the x86 model works really well, but I'd like
to see something that works even better, and avoids the
lock bouncing back and forth if the data cachelines don't
need to bounce back and forth (which I think is possible,
especially if you don't have finegrained locking: two
threads may want the same lock, but then touch totally
independent cachelines inside that "shared" locked region!)

But I don't actually have any numbers to back any of the
above up. It may be that in all real loads, the lock
bouncing either doesn't happen (because there is no real
cross-cpu contention anyway), or it is inevitable (because
the transactions will conflict anyway and the above will
just degenerate to the traditional spinlock mechanism in
all cases).

But I believe that:

- lockless algorithms are too hard for anything but
special cases and/or trivial stuff. Yes, it's good for
some really simple queuing etc. It's no good enough in

- locks do work, and with the right semantics, I think
you can get d*mn close to lockless performance.

- I suspect that if Nehalem does x86 atomics pipelined,
it will already basically be pretty optimal when the
lock itself does not bounce.

- something like the above suggestion might work
even when there is contention on the lock itself, but
not the data structures (which I believe to be a common
case when people haven't spent years to try to make
locking very fine-grained)

I also believe that I am an opinionated bastard, and may be
totally and utterly full of sh*t. Because I cannot actually
back my theory up with any real numbers ;)

I'm actually perfectly happy with x86 locking, and I'll
hopefully be happier still with Nehalem. Most of the loads
I care about personally are actually pretty dang linear and
have no cacheline bouncing what-so-ever.

And I have to admit that the strong x86 ordering semantics
of locked instructions means that they do have guarantees
that the above non-ordered ones wouldn't have (ie if you
have a uarch that does the transaction-based one with a
lock that can stay sharedm it would literally mean that you
would have multiple CPU's modifying different cachelines at
the same time even if they were inside the same lock).

I wonder how painful that might be, and how many bugs that
weaker lock ordering could expose. It might not be worth it
just for validation reasons.

I could also imagine that there is some reason why it's
a totally broken idea to start with. I'm not a hw engineer,
nor have I thought really seriously about transactional
memory, because it simply isn't a real-world concern today.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Nehalem Architecture: Improvements Detailed Blut Aus Nord2008/03/17 02:52 PM
  Nehalem Architecture: Improvements Detailed bah2008/03/17 04:45 PM
    Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/17 06:14 PM
      Nehalem Architecture: Improvements Detailed Gabriele Svelto2008/03/18 01:11 AM
        Nehalem Architecture: Improvements Detailed Henrik S2008/03/18 04:23 AM
        Nehalem Architecture: Improvements Detailed Doug Siebert2008/03/18 09:48 PM
          Nehalem Architecture: Improvements Detailed anon2008/03/18 10:37 PM
            Nehalem Architecture: Improvements Detailed Doug Siebert2008/03/19 05:23 PM
          Nehalem Architecture: Improvements Detailed Ian Ollmann2008/03/19 08:15 AM
            SSE 4.2 Michael S2008/03/19 04:13 PM
              SSE 4.2 Ian Ollmann2008/03/20 09:56 AM
              SSE 4.2 anonymous2008/03/20 12:29 PM
                SSE 4.2 David W. Hess2008/03/21 07:24 AM
                  SSE 4.2 anonymous2008/03/22 07:27 AM
      CMPXCHG latencyDavid Kanter2008/03/28 05:59 PM
        CMPXCHG latencyanonymous coward2008/03/28 10:24 PM
          CMPXCHG latencyDavid Kanter2008/03/28 10:26 PM
            CMPXCHG latencyLinus Torvalds2008/03/29 11:43 AM
              CMPXCHG latencyDavid W. Hess2008/03/29 11:56 AM
              CMPXCHG latencyLinus Torvalds2008/03/29 02:17 PM
                CMPXCHG latencyGabriele Svelto2008/03/31 12:25 AM
                  CMPXCHG latencyMichael S2008/03/31 12:38 AM
                    CMPXCHG latencynick2008/03/31 12:52 AM
                      CMPXCHG latencyMichael S2008/03/31 01:51 AM
                        CMPXCHG latencyGabriele Svelto2008/03/31 02:08 AM
                        CMPXCHG latencynick2008/03/31 07:20 PM
                          CMPXCHG latencyMichael S2008/04/01 01:14 AM
                            CMPXCHG latencynick2008/04/01 02:34 AM
                    CMPXCHG latencyLinus Torvalds2008/03/31 10:16 AM
                      CMPXCHG latencyAaron Spink2008/03/31 07:15 PM
                        CMPXCHG latencynick2008/03/31 07:34 PM
                        CMPXCHG latencyLinus Torvalds2008/04/01 08:25 AM
                          CMPXCHG latencyZan2008/04/01 09:54 PM
                          CMPXCHG latencyZan2008/04/02 12:11 AM
                            CMPXCHG latencyLinus Torvalds2008/04/02 08:04 AM
                              CMPXCHG latencyZan2008/04/02 11:02 AM
                                CMPXCHG latencyLinus Torvalds2008/04/02 12:02 PM
                                  CMPXCHG latencyZan2008/04/02 04:15 PM
                      CMPXCHG latencyMichael S2008/04/01 01:26 AM
                        CMPXCHG latencyLinus Torvalds2008/04/01 07:08 AM
                CMPXCHG latency - Intel sourceWouter Tinus2008/04/02 12:36 PM
                  CMPXCHG latency - Intel sourceLinus Torvalds2008/04/02 02:21 PM
                    CMPXCHG latency - Intel sourceDavid Kanter2008/04/02 02:39 PM
    Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/19 01:11 PM
      Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/19 01:43 PM
        CMPXCHG - all or nothingMichael S2008/03/19 03:49 PM
          multithreading - all or nothingno@thanks.com2008/03/19 05:17 PM
          CMPXCHG - all or nothingLinus Torvalds2008/03/19 05:21 PM
            CMPXCHG - all or nothingMichael S2008/03/20 06:38 AM
              CMPXCHG - all or nothingLinus Torvalds2008/03/20 08:45 AM
                CMPXCHG - all or nothingMichael S2008/03/21 07:08 AM
                  CMPXCHG - all or nothingLinus Torvalds2008/03/21 08:47 AM
            CMPXCHG - all or nothingHenrik S2008/03/20 10:09 AM
              CMPXCHG - all or nothingLinus Torvalds2008/03/20 10:53 AM
                CMPXCHG - all or nothingHenrik S2008/03/20 12:03 PM
                  CMPXCHG - all or nothingLinus Torvalds2008/03/20 01:12 PM
                    CMPXCHG - all or nothingHenrik S2008/03/21 12:13 AM
                      CMPXCHG - all or nothingGabriele Svelto2008/03/21 01:22 AM
        Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/19 06:28 PM
          Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/19 07:42 PM
            Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/20 06:03 PM
              Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/20 06:33 PM
                Nehalem Architecture: Improvements Detailed Philip Honermann2008/03/25 06:37 AM
                  Nehalem Architecture: Improvements Detailed Linus Torvalds2008/03/25 08:52 AM
                    What is DCAS? (NT)David Kanter2008/03/25 10:13 AM
                      Double compare-and-exchangeHenrik S2008/03/25 10:57 AM
                        Double compare-and-exchangeLinus Torvalds2008/03/25 11:38 AM
                          Double compare-and-exchangesavantu2008/03/25 01:54 PM
                            Double compare-and-exchangeLinus Torvalds2008/03/25 04:09 PM
                              Double compare-and-exchangeJamie Lucier2008/03/25 08:55 PM
                                Double compare-and-exchangesavantu2008/03/25 09:15 PM
                                  Double compare-and-exchangeHenrik S2008/03/26 08:40 AM
                                    Double compare-and-exchangeArun Ramakrishnan2008/03/27 02:07 AM
                                      Double compare-and-exchangeHenrik S2008/03/27 04:45 AM
                                  Surely GPL applies ?Richard Cownie2008/03/26 10:05 AM
                                    Surely GPL applies ?anon2008/03/26 02:58 PM
                                    Surely GPL applies ?Paul2008/03/26 05:01 PM
                                Double compare-and-exchangesomeone2008/03/25 09:18 PM
                                  Double compare-and-exchangeArun Ramakrishnan2008/03/27 02:03 AM
                                    Double compare-and-exchangesavantu2008/03/27 03:01 AM
                                      Double compare-and-exchangeArun Ramakrishnan2008/03/30 09:09 AM
                                        Double compare-and-exchangesavantu2008/03/30 09:59 AM
                                Double compare-and-exchangeLinus Torvalds2008/03/26 10:50 AM
                                  Double compare-and-exchangeanon2008/03/26 04:47 PM
                                  Double compare-and-exchangePaul2008/03/26 05:07 PM
                          Double compare-and-exchangeHoward Chu2008/03/25 05:18 PM
  Nehalem Architecture: Improvements Detailed Mr. Camel2008/03/17 08:50 PM
    Nehalem Architecture: Improvements Detailed anonymous2008/03/17 09:20 PM
  TFP will finally come :-)Paul A. Clayton2008/03/18 12:56 PM
  Nehalem Architecture: Improvements Detailed IntelUser20002008/03/27 07:46 PM
    Nehalem Architecture: Improvements Detailed David Kanter2008/03/27 10:21 PM
      Nehalem Architecture: Improvements Detailed nick2008/03/27 11:06 PM
        Nehalem Architecture: Improvements Detailed David Kanter2008/03/28 02:45 PM
          Nehalem Architecture: Improvements Detailed nick2008/03/28 07:52 PM
  L1 I-cachepuzzled2008/04/01 07:53 AM
    L1 I-cacheS. Rao2008/04/01 09:47 AM
    L1 I-cacherwessel2008/04/01 12:23 PM
    L1 I-cacheGabriele Svelto2008/04/03 12:30 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊