By: anon (anon.delete@this.anon.com), December 4, 2014 9:17 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on December 4, 2014 12:05 pm wrote:
> anon (anon.delete@this.anon.com) on December 3, 2014 5:08 pm wrote:
> >
> > We've talked about this before here, but LL/SC can guarantee progress (when it is limited
> > like POWER does), and the LL of course always carries a load-for-store signal.
>
> Both of these are "true", but not what I'm complaining about.
>
> First off, "guaranteed forward progress" in LL/SC tends to be a global thing: you're
> guaranteeing not to livelock. That is not interesting from a performance standpoint,
> it's just interesting from a "minimal requirements" standpoint.
Atomic RMW operations do not guarantee individual forward progress either. Some x86 implementations have had near-livelock or at least significant unfairness in them.
For a *constrained* LL/SC operation, ensuring forward progress and fairness is not such a problem in the core. It's in the coherency protocol, and in that case there is not much difference between them (either way the operation will be that the core requests the cacheline exclusive loads from it, performs and operation, and stores the result back, without relinquishing it).
>
> The thing is, the LL/SC model (and the load/cmpxchg model) does not guarantee that each
> thread makes forward progress, much less that each cache miss makes any progress.
A *constrained* LL/SC is much less like an (unconstrained) load and eventual cmpxchg, and much more like just the cmpxchg itself (or the fetch add, or decrement and test, etc).
I agree that completely unconstrained LL/SC are not viable, but that's not how they are used and not how they were intended to be used. They're a RISC-y kind of approach to building the compliment of basic read-modify-write operations.
> Seriously, it's a real issue. The cmpxchg fails. Not just occasionally. Under real load
> (admittedly very high contention), it fails quite often. And each failure is basically an
> extra and unnecessary ping-pong of a cacheline, where some other CPU ended up winning a
> race, and causing the cache access on the losing CPU to be pure and utter useless work.
Yes, and that's a disadvantage of x86. It has to use load + cmpxchg to fill in the gaps of its atomic primitives, and load + cmpxchg is certainly worse than LL/SC for doing that job. Arguably the core could do some fancy detection of the load, but it's significantly harder than for a constrained LL/SC, I would say.
>
> On x86 (which, again, is the only architecture you can actually compare these approaches),
> the numbers seem to be that if you update counters, the atomic RMW "add" model gets
> about twice the progress over a "load+add+cmpxchg" model. Twice.
That's not a valid comparison, as I said.
>
> And yes, that obviously ends up depending on cache coherency details etc, and how "sticky" a cacheline is
> to a CPU that got it. But that's actually a real potential issue too: being too sticky tends to improve
> throughput, but can cause some seriously excessive unfairness issues, where a node or a core that got exclusive
> access to the cacheline and keeps writing to it can get very unfair advantages wrt other cores.
>
> And yes, we've very much seen that too especially in NUMA environments.
>
> So LL/SC either fails a lot and causes ping-pong traffic while only making very slow
> progress (some progress, yes), or can try to avoid the failure case by making the
> cachelines very sticky and then become very unfair and amenable to imbalance.
That's simply not the case on decent implementations (like POWERx). They won't fail the SC a lot.
>
> A RMW model doesn't tend to have the same kind of issues. Once you got the cacheline,
> you will update it. There is no failure case and unnecessary cache transaction.
>
> And yes, you can in theory make LL/SC or load/cmpxchg work like a RMW by noticing the pattern and basically
> turning the bad LL/SC model into an RMW model by generating macro-instructions. And I actually think it's not
> a bad idea. But even if you do that, you basically have to first admit the superiority of the RMW model.
For constrained LL/SC, I believe that is what POWER CPUs do. I should stop repeating this because I still haven't been able to find the source of my claim, but I'm fairly sure I have read it in IBM publications.
I think it would be relatively easy to implement it with an optimistic first pass -- try to hold the cache line exclusive and avoid exceptions for a small period after LL (enough to avoid invalidation in 99.9x% of cases that you have a constrained LL/SC sequence), and then enter a slower path if the SC fails, which would do more work to guarantee it.
> As to the LL always implying write intent, I agree that it tends to make more sense.
> I'm not actually convinced everybody always does that. In the x86 world, where the
> pseudo-equivalent sequence is load/cmpxchg, we definitely have hit that issue.
Well "everybody always does" is not very useful. I'm sure some implementations have really shit atomic RMW instructions too.
IBM does load exclusive with LL of course. They've been improving it too, with the "exclusive access" hint in the LL.
Atomic Update (EH=0)
This hint indicates that the program is using a fetch and
operate (e.g., fetch and add) or some similar algorithm
and that all programs accessing the shared variable are
likely to use a similar operation to access the shared
variable for some time.
Exclusive Access (EH=1)
This hint indicates that the program is attempting to
acquire a lock and if it succeeds, will perform another
store to the lock variable (releasing the lock) before
another program attempts to modify the lock variable."
Not that this is any inherent advantage of LL/SC style (such hints could be put in atomic RMW instructions). Just that IBM is taking this stuff seriously, so they'll certainly have picked up such low hanging fruit as load-for-store.
> anon (anon.delete@this.anon.com) on December 3, 2014 5:08 pm wrote:
> >
> > We've talked about this before here, but LL/SC can guarantee progress (when it is limited
> > like POWER does), and the LL of course always carries a load-for-store signal.
>
> Both of these are "true", but not what I'm complaining about.
>
> First off, "guaranteed forward progress" in LL/SC tends to be a global thing: you're
> guaranteeing not to livelock. That is not interesting from a performance standpoint,
> it's just interesting from a "minimal requirements" standpoint.
Atomic RMW operations do not guarantee individual forward progress either. Some x86 implementations have had near-livelock or at least significant unfairness in them.
For a *constrained* LL/SC operation, ensuring forward progress and fairness is not such a problem in the core. It's in the coherency protocol, and in that case there is not much difference between them (either way the operation will be that the core requests the cacheline exclusive loads from it, performs and operation, and stores the result back, without relinquishing it).
>
> The thing is, the LL/SC model (and the load/cmpxchg model) does not guarantee that each
> thread makes forward progress, much less that each cache miss makes any progress.
A *constrained* LL/SC is much less like an (unconstrained) load and eventual cmpxchg, and much more like just the cmpxchg itself (or the fetch add, or decrement and test, etc).
I agree that completely unconstrained LL/SC are not viable, but that's not how they are used and not how they were intended to be used. They're a RISC-y kind of approach to building the compliment of basic read-modify-write operations.
> Seriously, it's a real issue. The cmpxchg fails. Not just occasionally. Under real load
> (admittedly very high contention), it fails quite often. And each failure is basically an
> extra and unnecessary ping-pong of a cacheline, where some other CPU ended up winning a
> race, and causing the cache access on the losing CPU to be pure and utter useless work.
Yes, and that's a disadvantage of x86. It has to use load + cmpxchg to fill in the gaps of its atomic primitives, and load + cmpxchg is certainly worse than LL/SC for doing that job. Arguably the core could do some fancy detection of the load, but it's significantly harder than for a constrained LL/SC, I would say.
>
> On x86 (which, again, is the only architecture you can actually compare these approaches),
> the numbers seem to be that if you update counters, the atomic RMW "add" model gets
> about twice the progress over a "load+add+cmpxchg" model. Twice.
That's not a valid comparison, as I said.
>
> And yes, that obviously ends up depending on cache coherency details etc, and how "sticky" a cacheline is
> to a CPU that got it. But that's actually a real potential issue too: being too sticky tends to improve
> throughput, but can cause some seriously excessive unfairness issues, where a node or a core that got exclusive
> access to the cacheline and keeps writing to it can get very unfair advantages wrt other cores.
>
> And yes, we've very much seen that too especially in NUMA environments.
>
> So LL/SC either fails a lot and causes ping-pong traffic while only making very slow
> progress (some progress, yes), or can try to avoid the failure case by making the
> cachelines very sticky and then become very unfair and amenable to imbalance.
That's simply not the case on decent implementations (like POWERx). They won't fail the SC a lot.
>
> A RMW model doesn't tend to have the same kind of issues. Once you got the cacheline,
> you will update it. There is no failure case and unnecessary cache transaction.
>
> And yes, you can in theory make LL/SC or load/cmpxchg work like a RMW by noticing the pattern and basically
> turning the bad LL/SC model into an RMW model by generating macro-instructions. And I actually think it's not
> a bad idea. But even if you do that, you basically have to first admit the superiority of the RMW model.
For constrained LL/SC, I believe that is what POWER CPUs do. I should stop repeating this because I still haven't been able to find the source of my claim, but I'm fairly sure I have read it in IBM publications.
I think it would be relatively easy to implement it with an optimistic first pass -- try to hold the cache line exclusive and avoid exceptions for a small period after LL (enough to avoid invalidation in 99.9x% of cases that you have a constrained LL/SC sequence), and then enter a slower path if the SC fails, which would do more work to guarantee it.
> As to the LL always implying write intent, I agree that it tends to make more sense.
> I'm not actually convinced everybody always does that. In the x86 world, where the
> pseudo-equivalent sequence is load/cmpxchg, we definitely have hit that issue.
Well "everybody always does" is not very useful. I'm sure some implementations have really shit atomic RMW instructions too.
IBM does load exclusive with LL of course. They've been improving it too, with the "exclusive access" hint in the LL.
Atomic Update (EH=0)
This hint indicates that the program is using a fetch and
operate (e.g., fetch and add) or some similar algorithm
and that all programs accessing the shared variable are
likely to use a similar operation to access the shared
variable for some time.
Exclusive Access (EH=1)
This hint indicates that the program is attempting to
acquire a lock and if it succeeds, will perform another
store to the lock variable (releasing the lock) before
another program attempts to modify the lock variable."
Not that this is any inherent advantage of LL/SC style (such hints could be put in atomic RMW instructions). Just that IBM is taking this stuff seriously, so they'll certainly have picked up such low hanging fruit as load-for-store.