By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), April 13, 2021 9:53 am
Room: Moderated Discussions
Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr) on April 13, 2021 1:01 am wrote:
>
> I wonder if it would be useful to have an area of fast memory (SRAM) inside the chip, accessed by all cores
> without cache, to implement locking for restricted parts (i.e. kernel only, due to the size limit).
No, for two main reasons.
First off, people really need to realize that while the contended case shows up in benchmarks a lot (particularly with bad locking), the normal case is not contended.
Really. All the HTM people seem to have entirely missed the boat on this. They are looking at the upside in often very synthetic benchmarks, and they are looking for the upside in the worst-case scenario.
Sure, if your aim is to improve some benchmark, that worst-case scenario may well be exactly what you need to look at. But realize that all you are doing is improving benchmarks, not real loads for real customers.
Benchmark loads are often simplistic, and hit the same thing over and over again. And even good benchmarks on big machines are very very seldom going to be indicative of what those big machines do in real life, because absolutely nobody competent will ever buy a machine that is expected to be that close to the edge.
In fact, a lot of the people who pay top dollar for machines will do so exactly because they want to never be even remotely close to that situation, because they often care deeply about latency of requests, and being at the edge of that kind of contention is simply not acceptable. The extreme case of this is obviously things like high-frequency trading, where people will spend insane amounts on hardware and they are almost entirely focused on latency.
That extreme case is obviously kind of special, but even in the regular case latency matters in real life, and benchmarks are often throughput-oriented (possibly with some "maximum latency" guidance which ends up being for loads that are not even remotely close to what is acceptable in real life).
So if you design your transactional memory for that situation, you'll generally find that when customers actually run their loads in real life, they'll not be at that contention level, and if your transactional memory now performs worse than the mostly uncontended lock did, then your customers will see that worse performance, and they'll go "Oh, let's turn off HTM, it actually hurts us".
And guess what happened in real life?
The exact same thing is true of special locking hardware. For it to be a success, it needs to handle the non-contended case well.
And the thing is, caches tend to capture the non-contended case really well. In ways that some special hardware lock that isn't close to a core would not.
And doing locks in special on-chip locking things means that now you can't embed your locks in your data structures, and you only have a limited number of locks, so now you're hitting other problem cases too. Yes, you can use tricks like hashing to "map" the data structure to a lock in that on-chip area, but at that point you get all kinds of interesting situations like ABBA deadlocks when two different data structures end up using the same hashed hw lock, and it's just a big source of new complexity.
The other reason hw locking isn't great is simply that the real problems with contention show up not when you have one single die, but when you have multiple sockets (or "multiple dies on the same package"). If you have only one piece of silicon, you are going to have an even harder time showing the problem cases.
And again, this shows just how out of touch the HTM people are - including in this very thread. I've literally seen arguments where "HTM doesn't make sense for single socket, because you won't have contention". Yeah, if that's your HTM design, your HTM design is garbage by definition. The transactional memory needs to help latency and the uncontended case too. If your transactional memory can't keep up in the uncontended case, your transactional memory is shit.
And finally - special hardware locking very much exists. But it exists not for general purpose CPU locking, but for the case where you have different cores (possibly IO accelerators, things like that) that have synchronization issues. Even that case might be handled by the different cores just having some protocol wrt locks in memory, but historically those IO accelerators might not participate in cache coherence etc, so you may well end up with special locking sequences and hardware support for synchronization.
Linus
>
> I wonder if it would be useful to have an area of fast memory (SRAM) inside the chip, accessed by all cores
> without cache, to implement locking for restricted parts (i.e. kernel only, due to the size limit).
No, for two main reasons.
First off, people really need to realize that while the contended case shows up in benchmarks a lot (particularly with bad locking), the normal case is not contended.
Really. All the HTM people seem to have entirely missed the boat on this. They are looking at the upside in often very synthetic benchmarks, and they are looking for the upside in the worst-case scenario.
Sure, if your aim is to improve some benchmark, that worst-case scenario may well be exactly what you need to look at. But realize that all you are doing is improving benchmarks, not real loads for real customers.
Benchmark loads are often simplistic, and hit the same thing over and over again. And even good benchmarks on big machines are very very seldom going to be indicative of what those big machines do in real life, because absolutely nobody competent will ever buy a machine that is expected to be that close to the edge.
In fact, a lot of the people who pay top dollar for machines will do so exactly because they want to never be even remotely close to that situation, because they often care deeply about latency of requests, and being at the edge of that kind of contention is simply not acceptable. The extreme case of this is obviously things like high-frequency trading, where people will spend insane amounts on hardware and they are almost entirely focused on latency.
That extreme case is obviously kind of special, but even in the regular case latency matters in real life, and benchmarks are often throughput-oriented (possibly with some "maximum latency" guidance which ends up being for loads that are not even remotely close to what is acceptable in real life).
So if you design your transactional memory for that situation, you'll generally find that when customers actually run their loads in real life, they'll not be at that contention level, and if your transactional memory now performs worse than the mostly uncontended lock did, then your customers will see that worse performance, and they'll go "Oh, let's turn off HTM, it actually hurts us".
And guess what happened in real life?
The exact same thing is true of special locking hardware. For it to be a success, it needs to handle the non-contended case well.
And the thing is, caches tend to capture the non-contended case really well. In ways that some special hardware lock that isn't close to a core would not.
And doing locks in special on-chip locking things means that now you can't embed your locks in your data structures, and you only have a limited number of locks, so now you're hitting other problem cases too. Yes, you can use tricks like hashing to "map" the data structure to a lock in that on-chip area, but at that point you get all kinds of interesting situations like ABBA deadlocks when two different data structures end up using the same hashed hw lock, and it's just a big source of new complexity.
The other reason hw locking isn't great is simply that the real problems with contention show up not when you have one single die, but when you have multiple sockets (or "multiple dies on the same package"). If you have only one piece of silicon, you are going to have an even harder time showing the problem cases.
And again, this shows just how out of touch the HTM people are - including in this very thread. I've literally seen arguments where "HTM doesn't make sense for single socket, because you won't have contention". Yeah, if that's your HTM design, your HTM design is garbage by definition. The transactional memory needs to help latency and the uncontended case too. If your transactional memory can't keep up in the uncontended case, your transactional memory is shit.
And finally - special hardware locking very much exists. But it exists not for general purpose CPU locking, but for the case where you have different cores (possibly IO accelerators, things like that) that have synchronization issues. Even that case might be handled by the different cores just having some protocol wrt locks in memory, but historically those IO accelerators might not participate in cache coherence etc, so you may well end up with special locking sequences and hardware support for synchronization.
Linus