By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), March 20, 2008 8:45 am
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 3/20/08 wrote:
>
>I did read Intel memory ordering whitepaper before posting
>the original post. It clearly states that locked operations
>from different agents are ordered in absolute machine-wide
>order (i.e. seen in the same order by all observers) rather
>than in "processor order" like the rest of the WB stores.
Ahh, I think that is just a natural outcome of the all the
other rules - ie loads are not re-ordered wrt each other,
and stores also are guaranteed to be seen in the same order.
Couple that with the fact that locked instructions also
disable some of the reordering that is allowed (ie
the local store-to-load forwarding and loads passing
stores), and you end up with total ordering for the locked
ops.
>I'm too stupid to figure out whether semantics like these
>constrain possible implementations of big hierarchical
>directory based NUMA machines.
Hey, being too stupid to understand all the implications of
memory ordering means that pretty exactly 100% of humanity
is too stupid. Those things are really really subtle, but I
do think that the 2.7 rule is not actually fundamental, but
springs directly from the previous rules together with
the fact that atomic ops hold on to the cacheline.
And no, none of it needs non-local information. Basically,
here's the rule:
- in order for a CPU to do a locked op, it has to have
that cacheline exclusively and continuously for the read
and write (that "continuously" part is different: a
non-locked access can split the read and write and the
read could be satisfied earlier and then the cacheline
lost and regained).
- in order for another CPU to see the result, it obviously
will have to have that cacheline too.
- now, if you track the cacheline ownership for that 2.7
case, the disallowed case simply cannot happen.
Why? Because processors 2 and 3 can only reorder their
reads to _x and _y if they keep them in their cache across
the execution of both instructions (otherwise, the load
reordering would be architecturally visible, which is
against 2.1).
Think of _x and _y as being cachelines, and the locked
ops as being "guaranteed sequence point for that cacheline",
simply because it's owned exclusively by that CPU for
that one whole instruction.
So in order for the illegal case in 2.7 to happen (r3=1,
r4=0, r5=1, r6=0), we know that the movement for cacheline
_x must have been: P3x -> P0x -> P2x. And similarly for _y
it must have been P2y -> P1y ->P3y. Agreed?
That's all possible, because we're talking about two
different cachelines, and they can obviously move around
their ownership independently of each other.
But now, look at P2: because of the load order requirement,
we know that P2 must have held _x before it held _y,
because otherwise the load ordering (rule 2.1) is no longer
valid. So we know that P2x -> P2y. Similarly, for the same
reasons, we know that P3y -> P3x.
And this is where x86 is different from your regular weakly
ordered read-rule CPU. If you allow reads to be visibly
reordered wrt other reads, those P2x -> P2y and P3y -> P3x
guarantees do not exist, and the 2.7 guarantee doesn't hold.
But on x86 you do know that either P2 held both cachelines
over both instructions or it did the P2y fill later than P2x
(and the same for P3), because anything else would violate
the local load ordering constraints of 2.1 and the CPU could
be architecturally seen to re-order loads.
(Note the importance of "architecturally". P2 can
do load re-ordering internally, but only if it holds on to
the cachelines in question across the whole sequence, which
makes the reordering architecturally invisible since
clearly no other CPU could have modified those values
and shown an ordering violation!)
So now you have an impossible sequence. There's no way
we can satisfy all of
P3x -> P0x -> P2x (cacheline _x movement order)
P2x -> P2y (CPU2 cacheline load order)
P2y -> P1y -> P3y (cacheline _y movement order)
P3y -> P3x (CPU3 cacheline load order)
because that would be a circular dependency (P3x -> .. ->
P3x).
And notice how all these things are "local" decisions. All
of those ordering points are decisions about a single
cacheline state change at a single CPU, there's no "global
ordering" required for the above argument, every point is
just a local decision.
But yeah, memory ordering is really easy to get wrong, so
maybe I have a thinko there somewhere. But basically I
claim that it falls out of just the cache coherency rules
and the new stricter load ordering that Intel guarantees.
Linus
>
>I did read Intel memory ordering whitepaper before posting
>the original post. It clearly states that locked operations
>from different agents are ordered in absolute machine-wide
>order (i.e. seen in the same order by all observers) rather
>than in "processor order" like the rest of the WB stores.
Ahh, I think that is just a natural outcome of the all the
other rules - ie loads are not re-ordered wrt each other,
and stores also are guaranteed to be seen in the same order.
Couple that with the fact that locked instructions also
disable some of the reordering that is allowed (ie
the local store-to-load forwarding and loads passing
stores), and you end up with total ordering for the locked
ops.
>I'm too stupid to figure out whether semantics like these
>constrain possible implementations of big hierarchical
>directory based NUMA machines.
Hey, being too stupid to understand all the implications of
memory ordering means that pretty exactly 100% of humanity
is too stupid. Those things are really really subtle, but I
do think that the 2.7 rule is not actually fundamental, but
springs directly from the previous rules together with
the fact that atomic ops hold on to the cacheline.
And no, none of it needs non-local information. Basically,
here's the rule:
- in order for a CPU to do a locked op, it has to have
that cacheline exclusively and continuously for the read
and write (that "continuously" part is different: a
non-locked access can split the read and write and the
read could be satisfied earlier and then the cacheline
lost and regained).
- in order for another CPU to see the result, it obviously
will have to have that cacheline too.
- now, if you track the cacheline ownership for that 2.7
case, the disallowed case simply cannot happen.
Why? Because processors 2 and 3 can only reorder their
reads to _x and _y if they keep them in their cache across
the execution of both instructions (otherwise, the load
reordering would be architecturally visible, which is
against 2.1).
Think of _x and _y as being cachelines, and the locked
ops as being "guaranteed sequence point for that cacheline",
simply because it's owned exclusively by that CPU for
that one whole instruction.
So in order for the illegal case in 2.7 to happen (r3=1,
r4=0, r5=1, r6=0), we know that the movement for cacheline
_x must have been: P3x -> P0x -> P2x. And similarly for _y
it must have been P2y -> P1y ->P3y. Agreed?
That's all possible, because we're talking about two
different cachelines, and they can obviously move around
their ownership independently of each other.
But now, look at P2: because of the load order requirement,
we know that P2 must have held _x before it held _y,
because otherwise the load ordering (rule 2.1) is no longer
valid. So we know that P2x -> P2y. Similarly, for the same
reasons, we know that P3y -> P3x.
And this is where x86 is different from your regular weakly
ordered read-rule CPU. If you allow reads to be visibly
reordered wrt other reads, those P2x -> P2y and P3y -> P3x
guarantees do not exist, and the 2.7 guarantee doesn't hold.
But on x86 you do know that either P2 held both cachelines
over both instructions or it did the P2y fill later than P2x
(and the same for P3), because anything else would violate
the local load ordering constraints of 2.1 and the CPU could
be architecturally seen to re-order loads.
(Note the importance of "architecturally". P2 can
do load re-ordering internally, but only if it holds on to
the cachelines in question across the whole sequence, which
makes the reordering architecturally invisible since
clearly no other CPU could have modified those values
and shown an ordering violation!)
So now you have an impossible sequence. There's no way
we can satisfy all of
P3x -> P0x -> P2x (cacheline _x movement order)
P2x -> P2y (CPU2 cacheline load order)
P2y -> P1y -> P3y (cacheline _y movement order)
P3y -> P3x (CPU3 cacheline load order)
because that would be a circular dependency (P3x -> .. ->
P3x).
And notice how all these things are "local" decisions. All
of those ordering points are decisions about a single
cacheline state change at a single CPU, there's no "global
ordering" required for the above argument, every point is
just a local decision.
But yeah, memory ordering is really easy to get wrong, so
maybe I have a thinko there somewhere. But basically I
claim that it falls out of just the cache coherency rules
and the new stricter load ordering that Intel guarantees.
Linus
Topic | Posted By | Date |
---|---|---|
Nehalem Architecture: Improvements Detailed | Blut Aus Nord | 2008/03/17 02:52 PM |
Nehalem Architecture: Improvements Detailed | bah | 2008/03/17 04:45 PM |
Nehalem Architecture: Improvements Detailed | Linus Torvalds | 2008/03/17 06:14 PM |
Nehalem Architecture: Improvements Detailed | Gabriele Svelto | 2008/03/18 01:11 AM |
Nehalem Architecture: Improvements Detailed | Henrik S | 2008/03/18 04:23 AM |
Nehalem Architecture: Improvements Detailed | Doug Siebert | 2008/03/18 09:48 PM |
Nehalem Architecture: Improvements Detailed | anon | 2008/03/18 10:37 PM |
Nehalem Architecture: Improvements Detailed | Doug Siebert | 2008/03/19 05:23 PM |
Nehalem Architecture: Improvements Detailed | Ian Ollmann | 2008/03/19 08:15 AM |
SSE 4.2 | Michael S | 2008/03/19 04:13 PM |
SSE 4.2 | Ian Ollmann | 2008/03/20 09:56 AM |
SSE 4.2 | anonymous | 2008/03/20 12:29 PM |
SSE 4.2 | David W. Hess | 2008/03/21 07:24 AM |
SSE 4.2 | anonymous | 2008/03/22 07:27 AM |
CMPXCHG latency | David Kanter | 2008/03/28 05:59 PM |
CMPXCHG latency | anonymous coward | 2008/03/28 10:24 PM |
CMPXCHG latency | David Kanter | 2008/03/28 10:26 PM |
CMPXCHG latency | Linus Torvalds | 2008/03/29 11:43 AM |
CMPXCHG latency | David W. Hess | 2008/03/29 11:56 AM |
CMPXCHG latency | Linus Torvalds | 2008/03/29 02:17 PM |
CMPXCHG latency | Gabriele Svelto | 2008/03/31 12:25 AM |
CMPXCHG latency | Michael S | 2008/03/31 12:38 AM |
CMPXCHG latency | nick | 2008/03/31 12:52 AM |
CMPXCHG latency | Michael S | 2008/03/31 01:51 AM |
CMPXCHG latency | Gabriele Svelto | 2008/03/31 02:08 AM |
CMPXCHG latency | nick | 2008/03/31 07:20 PM |
CMPXCHG latency | Michael S | 2008/04/01 01:14 AM |
CMPXCHG latency | nick | 2008/04/01 02:34 AM |
CMPXCHG latency | Linus Torvalds | 2008/03/31 10:16 AM |
CMPXCHG latency | Aaron Spink | 2008/03/31 07:15 PM |
CMPXCHG latency | nick | 2008/03/31 07:34 PM |
CMPXCHG latency | Linus Torvalds | 2008/04/01 08:25 AM |
CMPXCHG latency | Zan | 2008/04/01 09:54 PM |
CMPXCHG latency | Zan | 2008/04/02 12:11 AM |
CMPXCHG latency | Linus Torvalds | 2008/04/02 08:04 AM |
CMPXCHG latency | Zan | 2008/04/02 11:02 AM |
CMPXCHG latency | Linus Torvalds | 2008/04/02 12:02 PM |
CMPXCHG latency | Zan | 2008/04/02 04:15 PM |
CMPXCHG latency | Michael S | 2008/04/01 01:26 AM |
CMPXCHG latency | Linus Torvalds | 2008/04/01 07:08 AM |
CMPXCHG latency - Intel source | Wouter Tinus | 2008/04/02 12:36 PM |
CMPXCHG latency - Intel source | Linus Torvalds | 2008/04/02 02:21 PM |
CMPXCHG latency - Intel source | David Kanter | 2008/04/02 02:39 PM |
Nehalem Architecture: Improvements Detailed | Philip Honermann | 2008/03/19 01:11 PM |
Nehalem Architecture: Improvements Detailed | Linus Torvalds | 2008/03/19 01:43 PM |
CMPXCHG - all or nothing | Michael S | 2008/03/19 03:49 PM |
multithreading - all or nothing | no@thanks.com | 2008/03/19 05:17 PM |
CMPXCHG - all or nothing | Linus Torvalds | 2008/03/19 05:21 PM |
CMPXCHG - all or nothing | Michael S | 2008/03/20 06:38 AM |
CMPXCHG - all or nothing | Linus Torvalds | 2008/03/20 08:45 AM |
CMPXCHG - all or nothing | Michael S | 2008/03/21 07:08 AM |
CMPXCHG - all or nothing | Linus Torvalds | 2008/03/21 08:47 AM |
CMPXCHG - all or nothing | Henrik S | 2008/03/20 10:09 AM |
CMPXCHG - all or nothing | Linus Torvalds | 2008/03/20 10:53 AM |
CMPXCHG - all or nothing | Henrik S | 2008/03/20 12:03 PM |
CMPXCHG - all or nothing | Linus Torvalds | 2008/03/20 01:12 PM |
CMPXCHG - all or nothing | Henrik S | 2008/03/21 12:13 AM |
CMPXCHG - all or nothing | Gabriele Svelto | 2008/03/21 01:22 AM |
Nehalem Architecture: Improvements Detailed | Philip Honermann | 2008/03/19 06:28 PM |
Nehalem Architecture: Improvements Detailed | Linus Torvalds | 2008/03/19 07:42 PM |
Nehalem Architecture: Improvements Detailed | Philip Honermann | 2008/03/20 06:03 PM |
Nehalem Architecture: Improvements Detailed | Linus Torvalds | 2008/03/20 06:33 PM |
Nehalem Architecture: Improvements Detailed | Philip Honermann | 2008/03/25 06:37 AM |
Nehalem Architecture: Improvements Detailed | Linus Torvalds | 2008/03/25 08:52 AM |
What is DCAS? (NT) | David Kanter | 2008/03/25 10:13 AM |
Double compare-and-exchange | Henrik S | 2008/03/25 10:57 AM |
Double compare-and-exchange | Linus Torvalds | 2008/03/25 11:38 AM |
Double compare-and-exchange | savantu | 2008/03/25 01:54 PM |
Double compare-and-exchange | Linus Torvalds | 2008/03/25 04:09 PM |
Double compare-and-exchange | Jamie Lucier | 2008/03/25 08:55 PM |
Double compare-and-exchange | savantu | 2008/03/25 09:15 PM |
Double compare-and-exchange | Henrik S | 2008/03/26 08:40 AM |
Double compare-and-exchange | Arun Ramakrishnan | 2008/03/27 02:07 AM |
Double compare-and-exchange | Henrik S | 2008/03/27 04:45 AM |
Surely GPL applies ? | Richard Cownie | 2008/03/26 10:05 AM |
Surely GPL applies ? | anon | 2008/03/26 02:58 PM |
Surely GPL applies ? | Paul | 2008/03/26 05:01 PM |
Double compare-and-exchange | someone | 2008/03/25 09:18 PM |
Double compare-and-exchange | Arun Ramakrishnan | 2008/03/27 02:03 AM |
Double compare-and-exchange | savantu | 2008/03/27 03:01 AM |
Double compare-and-exchange | Arun Ramakrishnan | 2008/03/30 09:09 AM |
Double compare-and-exchange | savantu | 2008/03/30 09:59 AM |
Double compare-and-exchange | Linus Torvalds | 2008/03/26 10:50 AM |
Double compare-and-exchange | anon | 2008/03/26 04:47 PM |
Double compare-and-exchange | Paul | 2008/03/26 05:07 PM |
Double compare-and-exchange | Howard Chu | 2008/03/25 05:18 PM |
Nehalem Architecture: Improvements Detailed | Mr. Camel | 2008/03/17 08:50 PM |
Nehalem Architecture: Improvements Detailed | anonymous | 2008/03/17 09:20 PM |
TFP will finally come :-) | Paul A. Clayton | 2008/03/18 12:56 PM |
Nehalem Architecture: Improvements Detailed | IntelUser2000 | 2008/03/27 07:46 PM |
Nehalem Architecture: Improvements Detailed | David Kanter | 2008/03/27 10:21 PM |
Nehalem Architecture: Improvements Detailed | nick | 2008/03/27 11:06 PM |
Nehalem Architecture: Improvements Detailed | David Kanter | 2008/03/28 02:45 PM |
Nehalem Architecture: Improvements Detailed | nick | 2008/03/28 07:52 PM |
L1 I-cache | puzzled | 2008/04/01 07:53 AM |
L1 I-cache | S. Rao | 2008/04/01 09:47 AM |
L1 I-cache | rwessel | 2008/04/01 12:23 PM |
L1 I-cache | Gabriele Svelto | 2008/04/03 12:30 AM |