By: anon2 (anon.delete@this.anon.com), March 31, 2021 3:57 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 31, 2021 11:50 am wrote:
> Andrey (andrey.semashev.delete@this.gmail.com) on March 31, 2021 5:27 am wrote:
> >
> > You obviously have to write non-transactional path, and it will have its pitfalls, but the point
> > is that you could have better best-case and average performance with TSX.
>
> No, you really really don't.
>
> TSX was slow even when it worked and didn't have aborts, and never gave you "best-case"
> performance at all due to that. Simple non-contended non-TSX locks worked better.
>
> And TSX was a complete disaster when you had any data contention, and just caused overhead and aborts
> and fallbacks to locked code, so - no surprise - plain non-TSX locks worked better. And data contention
> is quite common, and happened for a lot of trivial reasons (statistics being one).
>
> And no, TSX didn't have better average performance either, because in order to avoid the
> problems, you had to do statistics in software, which added its own set of overhead.
>
> As far as I know, there were approximately zero real-world loads that were better with TSX than without.
>
> The only case that TSX ever did ok on was when there was zero data contention at all, and lots of cache
> coherence costs due almost entirely due to locking, and then TSX can keep the lock as a shared cache
> line. Yes, this really can happen, but most of the time it happens is when you also have big enough
> locked regions that they don't get caught by the transactional memory due to size overflows.
>
> And making the transaction size larger makes the costs higher too, so now you need to do a
> much better job at predicting ahead of time whether transactions will succeed or not. Which
> Intel entirely screwed up, and I blame them completely. I told them at the first meeting
> they had (before TSX was public) that they need to add a TSX predictor, and they never did.
>
> And the problems with TSX were legion, including dat aleaks and actual outright memory ordering bugs.
>
> TSX was garbage, and remains so.
>
> This is not to say that you couldn't get transactional memory right, but as it stands right now, I do not believe
> that anybody has ever had an actual successful and useful implementation of transactional memory.
>
> And I can pretty much guarantee that to do it right you need to have a transaction success predictor
> (like a branch predictor) so that software doesn't have to deal with yet another issue of "on
> this uarch, and this load, the transaction size is too small to fit this lock".
>
> I'm surprised that ARM made it part of v9 (and surprised that ARM kept the 32-bit
> compatibility part - I really thought they wanted to get rid of it).
It is interesting. Of the ARM implementations, the only ones I would have any hope at all of implementing transactional memory would be Apple. But they don't seem to be making moves to big server CPUs with a lot of cores and their basic locking and coherency should be plenty fast enough to scale to a handful of cores on one chip very well.
Maybe they're planning to get into servers? Or maybe some of the vendors pushing a lot of cores like Ampere are pushing for it hoping it will help their scalability issues. I don't think much of ARM Ltd's chances of making something that works well in that case, but stranger things have happened.
> Andrey (andrey.semashev.delete@this.gmail.com) on March 31, 2021 5:27 am wrote:
> >
> > You obviously have to write non-transactional path, and it will have its pitfalls, but the point
> > is that you could have better best-case and average performance with TSX.
>
> No, you really really don't.
>
> TSX was slow even when it worked and didn't have aborts, and never gave you "best-case"
> performance at all due to that. Simple non-contended non-TSX locks worked better.
>
> And TSX was a complete disaster when you had any data contention, and just caused overhead and aborts
> and fallbacks to locked code, so - no surprise - plain non-TSX locks worked better. And data contention
> is quite common, and happened for a lot of trivial reasons (statistics being one).
>
> And no, TSX didn't have better average performance either, because in order to avoid the
> problems, you had to do statistics in software, which added its own set of overhead.
>
> As far as I know, there were approximately zero real-world loads that were better with TSX than without.
>
> The only case that TSX ever did ok on was when there was zero data contention at all, and lots of cache
> coherence costs due almost entirely due to locking, and then TSX can keep the lock as a shared cache
> line. Yes, this really can happen, but most of the time it happens is when you also have big enough
> locked regions that they don't get caught by the transactional memory due to size overflows.
>
> And making the transaction size larger makes the costs higher too, so now you need to do a
> much better job at predicting ahead of time whether transactions will succeed or not. Which
> Intel entirely screwed up, and I blame them completely. I told them at the first meeting
> they had (before TSX was public) that they need to add a TSX predictor, and they never did.
>
> And the problems with TSX were legion, including dat aleaks and actual outright memory ordering bugs.
>
> TSX was garbage, and remains so.
>
> This is not to say that you couldn't get transactional memory right, but as it stands right now, I do not believe
> that anybody has ever had an actual successful and useful implementation of transactional memory.
>
> And I can pretty much guarantee that to do it right you need to have a transaction success predictor
> (like a branch predictor) so that software doesn't have to deal with yet another issue of "on
> this uarch, and this load, the transaction size is too small to fit this lock".
>
> I'm surprised that ARM made it part of v9 (and surprised that ARM kept the 32-bit
> compatibility part - I really thought they wanted to get rid of it).
It is interesting. Of the ARM implementations, the only ones I would have any hope at all of implementing transactional memory would be Apple. But they don't seem to be making moves to big server CPUs with a lot of cores and their basic locking and coherency should be plenty fast enough to scale to a handful of cores on one chip very well.
Maybe they're planning to get into servers? Or maybe some of the vendors pushing a lot of cores like Ampere are pushing for it hoping it will help their scalability issues. I don't think much of ARM Ltd's chances of making something that works well in that case, but stranger things have happened.