By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), August 22, 2022 11:32 am
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on August 21, 2022 9:33 pm wrote:
>
> Just so I understand, Linus, are you saying that
> (a) You NEED LL4/SC4 for correctness? That is, if you can trust LL4/SC4 to do what's promised, various
> constructions that today are very messy/finicky/slow can be replaced with something much simpler
No, I most certainly don't "need" it.
I'm perfectly happy with locking. That's the reality of today: if you need consistency, you use locking, or you use some very clever things that work without it.
Locking works. Locking is simple. And particularly in the kernel, we have optimized locking so much (often by avoiding it entirely and using lockfree models) that it's seldom even a huge problem.
So for the kernel, I'm personally pretty happy with our model is "spend the effort to spread out locks, and use lockfree algorithms where we can".
A decade ago I tried to use the Intel transactional memory in one particular code sequence that used to be very problematic - I literally got myself one of the rare CPUs that supported it just to be able to work on it.
And it was noticeably slower than locking for normal loads, because just the cost of starting a transaction was actually quite a lot bigger than the cost of a lock that wasn't contended.
And the contention case is quite hard to hit in most cases once you have spent the work on optimizing locking. Plus the transactional memory didn't end up helping even then, because we had optimized locking and the lock was spread out and next to the data it locked, so any contention just ended up aborting the transaction too.
Playing around with the failed transactional memory code made me implement the "double-wide special atomics" thing that I mentioned in my post, so it was useful for that reason - but it was useful in a "this transactional memory thing is absolutely horrible, but I can do this special sequence without it" sense.
Because sometimes failure in a new model is good - it makes you look at what you can just do with the old model.
So I actively despise transactional memory. I think it's a fundamental mistake, and apparently IBM, which is the only company to have ever gotten it to work, agrees with that.
> (b) are you saying that if something that could speculate across say 4 atomics that would give
> you what you wanted? In other words, it's not a question of replacing messy/finicky code, it's
> just a question of performance, but if speculation were to extend well from one atomic to two
> nearby atomics to four nearby atomics, at each stage there'd be a nice performance boost?
I am saying that if you are interested in transactional memory, I think there's a better model that gets you part of the way there without the horrendous downsides.
Because transactional memory really does have horrendous downsides. The complexity is crazy. The cost of a transaction - even in the good case - is just much too big.
If you have to have bits in the cacheline to say "this line is speculative", you have already lost.
If you have to have a checkpoint-and-restore model - even if it fits well with your OoO CPU design model - you have already lost.
If contention and transaction aborts ever drives you to a "now I need to take the lock after all" model, you have already lost.
So I'm saying that "abort on conflict" is a fundamentally untenable model, because it means that once you have contention, things just get horribly worse. It's why I think that "forward guarantee" model is a requirement.
All the "look, transactional memory is lovely" arguments were based on "contention only happens in a blue moon". But if it ever happens, and you make it worse, you are not the solution.
So all I'm saying that with roughly four atomics SW can do useful local optimizations, and it will probably never be an actual loss.
And don't take that '4' as some final value: I think it's the minimum value, and I think the maximum value to avoid excessive HW complexity is probably not much higher, but it is basically just a random number in the end.
It's a much more incremental approach than some "recompile with hardware transaction library": if you have a code-base that uses a "one big lock" approach, you'll have to then have a few "easy special cases" where you take the lock just to do something really small and trivial, and you can turn those few special cases into something better.
With the "few atomics" model you can then turn it into something that never takes the lock (but will still have to wait for it when others have taken said lock so that it's consistent). So if you do get contention, at least you don't make it worse.
And at least my own experience from optimizing kernel locking, I can say that in a lot of cases, locking really is for a ridiculously small thing. You'll still need proper locking for the big things, but if you can then do some simple but common things without having to take the lock, that can help drive down contention.
Trivial things like "if this queue is empty, do nothing" is surprisingly common - but to be consistent with the threads that are in the process of adding a new entry you generally do need locking. So then you take the lock only to check whether you need to do something else, and drop it again. Not great if there's contention - and not even great if there isn't.
But again: in the kernel we've already done most of this. We use lockfree algorithms that can do things like the above without ever taking a lock at all, just ordered reads.
Yes, I could still find several places to use that "few atomics" thing and it would probably help incrementally, but it wouldn't really be a big issue. We've fixed most of the big issues, we don't have locks that cover big regions, we just don't need much in this area.
But the big wins would have to be elsewhere. In places that tried to use transactional memory but failed because it was too intrusive, too fragile, and just too broken.
And maybe the answer is that "even that more limited model isn't worth it".
Honestly, I suspect that the biggest user of any "few atomics" model would be the locking libraries. Doing fair locking is actually not trivial - in the kernel we use small locks to protect the bigger lock data structures, and you often do want to do atomic sequences of "add myself to the queue of waiters" just to get the lock in the first place.
Linus
>
> Just so I understand, Linus, are you saying that
> (a) You NEED LL4/SC4 for correctness? That is, if you can trust LL4/SC4 to do what's promised, various
> constructions that today are very messy/finicky/slow can be replaced with something much simpler
No, I most certainly don't "need" it.
I'm perfectly happy with locking. That's the reality of today: if you need consistency, you use locking, or you use some very clever things that work without it.
Locking works. Locking is simple. And particularly in the kernel, we have optimized locking so much (often by avoiding it entirely and using lockfree models) that it's seldom even a huge problem.
So for the kernel, I'm personally pretty happy with our model is "spend the effort to spread out locks, and use lockfree algorithms where we can".
A decade ago I tried to use the Intel transactional memory in one particular code sequence that used to be very problematic - I literally got myself one of the rare CPUs that supported it just to be able to work on it.
And it was noticeably slower than locking for normal loads, because just the cost of starting a transaction was actually quite a lot bigger than the cost of a lock that wasn't contended.
And the contention case is quite hard to hit in most cases once you have spent the work on optimizing locking. Plus the transactional memory didn't end up helping even then, because we had optimized locking and the lock was spread out and next to the data it locked, so any contention just ended up aborting the transaction too.
Playing around with the failed transactional memory code made me implement the "double-wide special atomics" thing that I mentioned in my post, so it was useful for that reason - but it was useful in a "this transactional memory thing is absolutely horrible, but I can do this special sequence without it" sense.
Because sometimes failure in a new model is good - it makes you look at what you can just do with the old model.
So I actively despise transactional memory. I think it's a fundamental mistake, and apparently IBM, which is the only company to have ever gotten it to work, agrees with that.
> (b) are you saying that if something that could speculate across say 4 atomics that would give
> you what you wanted? In other words, it's not a question of replacing messy/finicky code, it's
> just a question of performance, but if speculation were to extend well from one atomic to two
> nearby atomics to four nearby atomics, at each stage there'd be a nice performance boost?
I am saying that if you are interested in transactional memory, I think there's a better model that gets you part of the way there without the horrendous downsides.
Because transactional memory really does have horrendous downsides. The complexity is crazy. The cost of a transaction - even in the good case - is just much too big.
If you have to have bits in the cacheline to say "this line is speculative", you have already lost.
If you have to have a checkpoint-and-restore model - even if it fits well with your OoO CPU design model - you have already lost.
If contention and transaction aborts ever drives you to a "now I need to take the lock after all" model, you have already lost.
So I'm saying that "abort on conflict" is a fundamentally untenable model, because it means that once you have contention, things just get horribly worse. It's why I think that "forward guarantee" model is a requirement.
All the "look, transactional memory is lovely" arguments were based on "contention only happens in a blue moon". But if it ever happens, and you make it worse, you are not the solution.
So all I'm saying that with roughly four atomics SW can do useful local optimizations, and it will probably never be an actual loss.
And don't take that '4' as some final value: I think it's the minimum value, and I think the maximum value to avoid excessive HW complexity is probably not much higher, but it is basically just a random number in the end.
It's a much more incremental approach than some "recompile with hardware transaction library": if you have a code-base that uses a "one big lock" approach, you'll have to then have a few "easy special cases" where you take the lock just to do something really small and trivial, and you can turn those few special cases into something better.
With the "few atomics" model you can then turn it into something that never takes the lock (but will still have to wait for it when others have taken said lock so that it's consistent). So if you do get contention, at least you don't make it worse.
And at least my own experience from optimizing kernel locking, I can say that in a lot of cases, locking really is for a ridiculously small thing. You'll still need proper locking for the big things, but if you can then do some simple but common things without having to take the lock, that can help drive down contention.
Trivial things like "if this queue is empty, do nothing" is surprisingly common - but to be consistent with the threads that are in the process of adding a new entry you generally do need locking. So then you take the lock only to check whether you need to do something else, and drop it again. Not great if there's contention - and not even great if there isn't.
But again: in the kernel we've already done most of this. We use lockfree algorithms that can do things like the above without ever taking a lock at all, just ordered reads.
Yes, I could still find several places to use that "few atomics" thing and it would probably help incrementally, but it wouldn't really be a big issue. We've fixed most of the big issues, we don't have locks that cover big regions, we just don't need much in this area.
But the big wins would have to be elsewhere. In places that tried to use transactional memory but failed because it was too intrusive, too fragile, and just too broken.
And maybe the answer is that "even that more limited model isn't worth it".
Honestly, I suspect that the biggest user of any "few atomics" model would be the locking libraries. Doing fair locking is actually not trivial - in the kernel we use small locks to protect the bigger lock data structures, and you often do want to do atomic sequences of "add myself to the queue of waiters" just to get the lock in the first place.
Linus