By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), April 3, 2021 2:22 pm
Room: Moderated Discussions
sr (nobody.delete@this.nowhere.com) on April 3, 2021 12:39 pm wrote:
>
> It's not about dirtying cachelines, it's about sharing critical section data.
sr, I understand transactional memory. Really. I do.
You're ignoring the big elephant in the room that I've pointed to over and over and over again: your theoretical arguments are pure and utter garbage in reality.
It's not "99.999% unmodified and only 0.001% data changes". That's your made-up dream world.
It's "real loads tend to actually change the same data".
I realize that you may have been fooled by various made-up benchmarks where you use some random distribution and "prove" that transactional memory works.
In real life, the access distribution isn't random. In real life, people hit on that same shared data all the time, and so you get a lot of conflicts. In real life, if you have that one big lock situation, then the locked section is often also so large that it won't fit in the transaction.
Just to make things really concrete, think about some "perfect" load for transactional memory, where you have hash lists, and you elide a lock around all the lookups and the modifications (because, let's face it, fine-grained locking is hard, particularly if you occasionally move elements around or have other ordering constraints).
Because the data elements are hashed, you expect that there will be very few actual data conflicts, and generally modifications are much less common than lookups anyway, so even if you end up touching the same data, it will mostly be a read-read that won't cause a conflict and an abort.
So this is basically the "wet dream" situation for transactional memory, where you expect to be able to do almost everything with nary a conflict in sight, and hopefully the operations you do on it are also small enough that you're not going to be anywhere near the capacity limits either.
And guess what? In real life, a lot of those loads also want to have basic statistics, and do things like success/fail ratios for how often the hash queue lookups actually hit in that hash, or just keep track of things like the queue length in order to dynamically resize the hash when required - etc etc.
And suddenly, those will all be globals that the lock you tried to elide protected, and the whole "a hash queue is distributed and we only touch 0.0001% of the data" is just a dream, because it turns out that every single lookup incremented that same single "success" counter.
Yeah, technically, that single counter is much much less than 0.001% of all the data. It's an absolutely miniscule part of your gigabytes of actual data. That doesn't help now, does it?
And then you realize that hey, not only did you have those globals, you also have a lot of false sharing, because any insertion onto a hash queue will change just one word, but it turns out that word was right next to the other seven hash queue pointer buckets that were in that same cacheline, and that first cacheline of the queue is also exactly the cacheline that all the lookups start out with.
So now that 0.001% - which was probably an optimistic and unrealistic number to begin with - was actually off by an order of magnitude just due to false sharing issues. But hey, hash queue modifications are rare, right? Yeah, sometimes they are. Not always.
And then you start actually doing real loads, and it really turns out that the access patterns really weren't even remotely random, and that everybody really is looking at the same hash bucket over and over again, because it turns out that certain data values are just much hotter and more common than others.
And it easily turns out that the hash lists themselves were embedded in data structures that were modified independently (and not part of the lookup logic), so now you have another source of false sharing right there - not with the lookups, but with the other threads that already looked things up and are now updating access counters for that entry, or whatever.
And you never saw those as cachelines bouncing back and forth before, because the single lock serialized them, so sure, the cacheline was moving around, but the concurrency was limited and HW prefetching actually worked and mostly hid it, and so it didn't actually show up as a huge problem. The lock showed up as a big issue in the profiles, but those kinds of incidentals didn't.
But suddenly, now that you're doing concurrent lookups, the reference counts or flag updates or whatever that you did as you successfully looked up an element end up bouncing that same cacheline which contained the the hash list you used for looking things up.
Because - after all - the whole point of lock elision was to take that code that wasn't designed for fine-grained locking, and just make it magically scale well. But the key here is really that "it wasn't designed for fine-grained locking". The data structures weren't designed to not have false sharing, or designed to not have those conflicts.
And yes, despite all these potential issues, under certain loads you never see any of them, and it all runs beautifully and like a bat out of hell.
And under other loads, you get these really hard to figure out aborts due to some odd issue that is really hard to debug, because you've gotten rid of all the statistics fields that caused you problems, and maybe you as a developer don't even have access to the load and the data that your customer is having performance problems with.
So stop with the theoretical "99.999% vs 0.001%".
If that was reality, then TSX would have taken the world by storm.
Hint: it didn't.
Linus
>
> It's not about dirtying cachelines, it's about sharing critical section data.
sr, I understand transactional memory. Really. I do.
You're ignoring the big elephant in the room that I've pointed to over and over and over again: your theoretical arguments are pure and utter garbage in reality.
It's not "99.999% unmodified and only 0.001% data changes". That's your made-up dream world.
It's "real loads tend to actually change the same data".
I realize that you may have been fooled by various made-up benchmarks where you use some random distribution and "prove" that transactional memory works.
In real life, the access distribution isn't random. In real life, people hit on that same shared data all the time, and so you get a lot of conflicts. In real life, if you have that one big lock situation, then the locked section is often also so large that it won't fit in the transaction.
Just to make things really concrete, think about some "perfect" load for transactional memory, where you have hash lists, and you elide a lock around all the lookups and the modifications (because, let's face it, fine-grained locking is hard, particularly if you occasionally move elements around or have other ordering constraints).
Because the data elements are hashed, you expect that there will be very few actual data conflicts, and generally modifications are much less common than lookups anyway, so even if you end up touching the same data, it will mostly be a read-read that won't cause a conflict and an abort.
So this is basically the "wet dream" situation for transactional memory, where you expect to be able to do almost everything with nary a conflict in sight, and hopefully the operations you do on it are also small enough that you're not going to be anywhere near the capacity limits either.
And guess what? In real life, a lot of those loads also want to have basic statistics, and do things like success/fail ratios for how often the hash queue lookups actually hit in that hash, or just keep track of things like the queue length in order to dynamically resize the hash when required - etc etc.
And suddenly, those will all be globals that the lock you tried to elide protected, and the whole "a hash queue is distributed and we only touch 0.0001% of the data" is just a dream, because it turns out that every single lookup incremented that same single "success" counter.
Yeah, technically, that single counter is much much less than 0.001% of all the data. It's an absolutely miniscule part of your gigabytes of actual data. That doesn't help now, does it?
And then you realize that hey, not only did you have those globals, you also have a lot of false sharing, because any insertion onto a hash queue will change just one word, but it turns out that word was right next to the other seven hash queue pointer buckets that were in that same cacheline, and that first cacheline of the queue is also exactly the cacheline that all the lookups start out with.
So now that 0.001% - which was probably an optimistic and unrealistic number to begin with - was actually off by an order of magnitude just due to false sharing issues. But hey, hash queue modifications are rare, right? Yeah, sometimes they are. Not always.
And then you start actually doing real loads, and it really turns out that the access patterns really weren't even remotely random, and that everybody really is looking at the same hash bucket over and over again, because it turns out that certain data values are just much hotter and more common than others.
And it easily turns out that the hash lists themselves were embedded in data structures that were modified independently (and not part of the lookup logic), so now you have another source of false sharing right there - not with the lookups, but with the other threads that already looked things up and are now updating access counters for that entry, or whatever.
And you never saw those as cachelines bouncing back and forth before, because the single lock serialized them, so sure, the cacheline was moving around, but the concurrency was limited and HW prefetching actually worked and mostly hid it, and so it didn't actually show up as a huge problem. The lock showed up as a big issue in the profiles, but those kinds of incidentals didn't.
But suddenly, now that you're doing concurrent lookups, the reference counts or flag updates or whatever that you did as you successfully looked up an element end up bouncing that same cacheline which contained the the hash list you used for looking things up.
Because - after all - the whole point of lock elision was to take that code that wasn't designed for fine-grained locking, and just make it magically scale well. But the key here is really that "it wasn't designed for fine-grained locking". The data structures weren't designed to not have false sharing, or designed to not have those conflicts.
And yes, despite all these potential issues, under certain loads you never see any of them, and it all runs beautifully and like a bat out of hell.
And under other loads, you get these really hard to figure out aborts due to some odd issue that is really hard to debug, because you've gotten rid of all the statistics fields that caused you problems, and maybe you as a developer don't even have access to the load and the data that your customer is having performance problems with.
So stop with the theoretical "99.999% vs 0.001%".
If that was reality, then TSX would have taken the world by storm.
Hint: it didn't.
Linus