By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), April 1, 2021 10:54 am
Room: Moderated Discussions
Andrey (andrey.semashev.delete@this.gmail.com) on March 31, 2021 11:31 pm wrote:
>
> I don't see why TSX and similar solutions wouldn't work in these constrained cases. AFAIR, Intel's
> TSX implementation actually surpasses the 4 cache lines and has no limit on instruction count.
[ I thought I already answered this, but I don't see it, so I probably previewed it and then forgot to post it ]
TSX could do it.
But when I tested it, using TSX for small locked regions was slower than just taking the lock. Both non-contended and when under contention.
The thing is, most locks do not have a lot of contention in real life. Yes, they can affect scalability in a big way, but that's usually true mostly in the "heavily loaded" case, and that is usually the case that people try to avoid.
So one very common case is "mostly single-threaded, little contention, latency matters".
And TSX was slower than just taking a lock in that case, so it hurt latency.
At the other end of the spectrum, you have the "lots of contention" case.
And TSX was slower in that case too, because then you have generally data contention too, and TSX aborts.
See the problem?
So TSX doesn't work well for the constrained case.
The only case that TSX would win is the non-constrained case when there's little or no data contention, and the contention is entirely on the lock itself - and TSX eliding the lock thus gets rid of the lock bouncing back and forth (because it can now be shared).
That case does exist in theory, and you can make nice benchmarks for it (and people did make benchmarks for it). But it's not actually that common. The main case it happens in is when you have the "one big lock for everything" thing, and "everything" ends up being so spread out that you don't see data contention there.
But it turns out that even when you do have that "one big lock", you do end up with data contention anyway, because the heavy loads aren't really all that spread out after all (the hot stuff tends to be the same things and people hammer on the same data over and over), and because you have one big lock you also have big locked regions, and then you have transaction capacity issues, you have data conflicts on other globals, and you get a fair amount of transaction aborts.
And yes, if you have code with the "one big lock" approach, that code will have all those other globals that conflict too. Statistics is one of the main ones, but not the only one by far. Codebases that haven't had a lot of locking work simply haven't had all the work to make things use per-thread variables etc.
So the case that TSX should be optimal for ("one big lock") tends to have all those other artifacts that then cause data conflicts inside that lock. So lots of aborts.
And those aborts are expensive.
They are expensive for two reasons:
(a) they are just inherently expensive: you did all that work, aborted, restarted, and had to do it with the backup plan and proper locking. Easily hundreds (or thousands) of cycles for the abort - wasted, because you have to fall back on the non-TSX code sequence.
(b) once you do proper locking, now everythinig has to do locking, because the TSX sequence fundamentally depended on that "check that the lock is not held by anybody else" for the lock sharing, and once that check fails (because one user aborted), now every user has to abort (or spin).
See? The devil is in the details. You can make all those nice arguments for why transactional memory is so great. But they tend to fall apart when the rubber hits the road.
The basic issue is that TSX is lovely in theory. But in practice is doesn't work well.
There were bugs, and the performance simply isn't great. Not for small transactions, but not for big ones either.
The problem space is just harder than people think it is.
Linus
>
> I don't see why TSX and similar solutions wouldn't work in these constrained cases. AFAIR, Intel's
> TSX implementation actually surpasses the 4 cache lines and has no limit on instruction count.
[ I thought I already answered this, but I don't see it, so I probably previewed it and then forgot to post it ]
TSX could do it.
But when I tested it, using TSX for small locked regions was slower than just taking the lock. Both non-contended and when under contention.
The thing is, most locks do not have a lot of contention in real life. Yes, they can affect scalability in a big way, but that's usually true mostly in the "heavily loaded" case, and that is usually the case that people try to avoid.
So one very common case is "mostly single-threaded, little contention, latency matters".
And TSX was slower than just taking a lock in that case, so it hurt latency.
At the other end of the spectrum, you have the "lots of contention" case.
And TSX was slower in that case too, because then you have generally data contention too, and TSX aborts.
See the problem?
So TSX doesn't work well for the constrained case.
The only case that TSX would win is the non-constrained case when there's little or no data contention, and the contention is entirely on the lock itself - and TSX eliding the lock thus gets rid of the lock bouncing back and forth (because it can now be shared).
That case does exist in theory, and you can make nice benchmarks for it (and people did make benchmarks for it). But it's not actually that common. The main case it happens in is when you have the "one big lock for everything" thing, and "everything" ends up being so spread out that you don't see data contention there.
But it turns out that even when you do have that "one big lock", you do end up with data contention anyway, because the heavy loads aren't really all that spread out after all (the hot stuff tends to be the same things and people hammer on the same data over and over), and because you have one big lock you also have big locked regions, and then you have transaction capacity issues, you have data conflicts on other globals, and you get a fair amount of transaction aborts.
And yes, if you have code with the "one big lock" approach, that code will have all those other globals that conflict too. Statistics is one of the main ones, but not the only one by far. Codebases that haven't had a lot of locking work simply haven't had all the work to make things use per-thread variables etc.
So the case that TSX should be optimal for ("one big lock") tends to have all those other artifacts that then cause data conflicts inside that lock. So lots of aborts.
And those aborts are expensive.
They are expensive for two reasons:
(a) they are just inherently expensive: you did all that work, aborted, restarted, and had to do it with the backup plan and proper locking. Easily hundreds (or thousands) of cycles for the abort - wasted, because you have to fall back on the non-TSX code sequence.
(b) once you do proper locking, now everythinig has to do locking, because the TSX sequence fundamentally depended on that "check that the lock is not held by anybody else" for the lock sharing, and once that check fails (because one user aborted), now every user has to abort (or spin).
See? The devil is in the details. You can make all those nice arguments for why transactional memory is so great. But they tend to fall apart when the rubber hits the road.
The basic issue is that TSX is lovely in theory. But in practice is doesn't work well.
There were bugs, and the performance simply isn't great. Not for small transactions, but not for big ones either.
The problem space is just harder than people think it is.
Linus