By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), December 4, 2014 12:05 pm
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on December 3, 2014 5:08 pm wrote:
>
> We've talked about this before here, but LL/SC can guarantee progress (when it is limited
> like POWER does), and the LL of course always carries a load-for-store signal.
Both of these are "true", but not what I'm complaining about.
First off, "guaranteed forward progress" in LL/SC tends to be a global thing: you're guaranteeing not to livelock. That is not interesting from a performance standpoint, it's just interesting from a "minimal requirements" standpoint.
The thing is, the LL/SC model (and the load/cmpxchg model) does not guarantee that each thread makes forward progress, much less that each cache miss makes any progress.
Seriously, it's a real issue. The cmpxchg fails. Not just occasionally. Under real load (admittedly very high contention), it fails quite often. And each failure is basically an extra and unnecessary ping-pong of a cacheline, where some other CPU ended up winning a race, and causing the cache access on the losing CPU to be pure and utter useless work.
On x86 (which, again, is the only architecture you can actually compare these approaches), the numbers seem to be that if you update counters, the atomic RMW "add" model gets about twice the progress over a "load+add+cmpxchg" model. Twice.
And yes, that obviously ends up depending on cache coherency details etc, and how "sticky" a cacheline is to a CPU that got it. But that's actually a real potential issue too: being too sticky tends to improve throughput, but can cause some seriously excessive unfairness issues, where a node or a core that got exclusive access to the cacheline and keeps writing to it can get very unfair advantages wrt other cores.
And yes, we've very much seen that too especially in NUMA environments.
So LL/SC either fails a lot and causes ping-pong traffic while only making very slow progress (some progress, yes), or can try to avoid the failure case by making the cachelines very sticky and then become very unfair and amenable to imbalance.
A RMW model doesn't tend to have the same kind of issues. Once you got the cacheline, you will update it. There is no failure case and unnecessary cache transaction.
And yes, you can in theory make LL/SC or load/cmpxchg work like a RMW by noticing the pattern and basically turning the bad LL/SC model into an RMW model by generating macro-instructions. And I actually think it's not a bad idea. But even if you do that, you basically have to first admit the superiority of the RMW model.
As to the LL always implying write intent, I agree that it tends to make more sense. I'm not actually convinced everybody always does that. In the x86 world, where the pseudo-equivalent sequence is load/cmpxchg, we definitely have hit that issue.
Linus
>
> We've talked about this before here, but LL/SC can guarantee progress (when it is limited
> like POWER does), and the LL of course always carries a load-for-store signal.
Both of these are "true", but not what I'm complaining about.
First off, "guaranteed forward progress" in LL/SC tends to be a global thing: you're guaranteeing not to livelock. That is not interesting from a performance standpoint, it's just interesting from a "minimal requirements" standpoint.
The thing is, the LL/SC model (and the load/cmpxchg model) does not guarantee that each thread makes forward progress, much less that each cache miss makes any progress.
Seriously, it's a real issue. The cmpxchg fails. Not just occasionally. Under real load (admittedly very high contention), it fails quite often. And each failure is basically an extra and unnecessary ping-pong of a cacheline, where some other CPU ended up winning a race, and causing the cache access on the losing CPU to be pure and utter useless work.
On x86 (which, again, is the only architecture you can actually compare these approaches), the numbers seem to be that if you update counters, the atomic RMW "add" model gets about twice the progress over a "load+add+cmpxchg" model. Twice.
And yes, that obviously ends up depending on cache coherency details etc, and how "sticky" a cacheline is to a CPU that got it. But that's actually a real potential issue too: being too sticky tends to improve throughput, but can cause some seriously excessive unfairness issues, where a node or a core that got exclusive access to the cacheline and keeps writing to it can get very unfair advantages wrt other cores.
And yes, we've very much seen that too especially in NUMA environments.
So LL/SC either fails a lot and causes ping-pong traffic while only making very slow progress (some progress, yes), or can try to avoid the failure case by making the cachelines very sticky and then become very unfair and amenable to imbalance.
A RMW model doesn't tend to have the same kind of issues. Once you got the cacheline, you will update it. There is no failure case and unnecessary cache transaction.
And yes, you can in theory make LL/SC or load/cmpxchg work like a RMW by noticing the pattern and basically turning the bad LL/SC model into an RMW model by generating macro-instructions. And I actually think it's not a bad idea. But even if you do that, you basically have to first admit the superiority of the RMW model.
As to the LL always implying write intent, I agree that it tends to make more sense. I'm not actually convinced everybody always does that. In the x86 world, where the pseudo-equivalent sequence is load/cmpxchg, we definitely have hit that issue.
Linus