By: anon3 (anon3.delete@this.rwt.com), March 31, 2021 4:09 pm
Room: Moderated Discussions
anon2 (anon.delete@this.anon.com) on March 31, 2021 3:57 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 31, 2021 11:50 am wrote:
> > Andrey (andrey.semashev.delete@this.gmail.com) on March 31, 2021 5:27 am wrote:
> > >
> > > You obviously have to write non-transactional path, and it will have its pitfalls, but the point
> > > is that you could have better best-case and average performance with TSX.
> >
> > No, you really really don't.
> >
> > TSX was slow even when it worked and didn't have aborts, and never gave you "best-case"
> > performance at all due to that. Simple non-contended non-TSX locks worked better.
> >
> > And TSX was a complete disaster when you had any data contention, and just caused overhead and aborts
> > and fallbacks to locked code, so - no surprise - plain non-TSX locks worked better. And data contention
> > is quite common, and happened for a lot of trivial reasons (statistics being one).
> >
> > And no, TSX didn't have better average performance either, because in order to avoid the
> > problems, you had to do statistics in software, which added its own set of overhead.
> >
> > As far as I know, there were approximately zero real-world loads that were better with TSX than without.
> >
> > The only case that TSX ever did ok on was when there was zero data contention at all, and lots of cache
> > coherence costs due almost entirely due to locking, and then TSX can keep the lock as a shared cache
> > line. Yes, this really can happen, but most of the time it happens is when you also have big enough
> > locked regions that they don't get caught by the transactional memory due to size overflows.
> >
> > And making the transaction size larger makes the costs higher too, so now you need to do a
> > much better job at predicting ahead of time whether transactions will succeed or not. Which
> > Intel entirely screwed up, and I blame them completely. I told them at the first meeting
> > they had (before TSX was public) that they need to add a TSX predictor, and they never did.
> >
> > And the problems with TSX were legion, including dat aleaks and actual outright memory ordering bugs.
> >
> > TSX was garbage, and remains so.
> >
> > This is not to say that you couldn't get transactional memory
> > right, but as it stands right now, I do not believe
> > that anybody has ever had an actual successful and useful implementation of transactional memory.
> >
> > And I can pretty much guarantee that to do it right you need to have a transaction success predictor
> > (like a branch predictor) so that software doesn't have to deal with yet another issue of "on
> > this uarch, and this load, the transaction size is too small to fit this lock".
> >
> > I'm surprised that ARM made it part of v9 (and surprised that ARM kept the 32-bit
> > compatibility part - I really thought they wanted to get rid of it).
>
> It is interesting. Of the ARM implementations, the only ones I would have any hope at
> all of implementing transactional memory would be Apple. But they don't seem to be making
> moves to big server CPUs with a lot of cores and their basic locking and coherency should
> be plenty fast enough to scale to a handful of cores on one chip very well.
>
> Maybe they're planning to get into servers? Or maybe some of the vendors pushing a lot of cores like
> Ampere are pushing for it hoping it will help their scalability issues. I don't think much of ARM Ltd's
> chances of making something that works well in that case, but stranger things have happened.
It's most likely that they added because some vendors or large hyperscale level customers of the vendors have been asking for it. There don't to be enough existence proofs of its effectiveness in real-world usage to have added it purely on quantitative terms.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 31, 2021 11:50 am wrote:
> > Andrey (andrey.semashev.delete@this.gmail.com) on March 31, 2021 5:27 am wrote:
> > >
> > > You obviously have to write non-transactional path, and it will have its pitfalls, but the point
> > > is that you could have better best-case and average performance with TSX.
> >
> > No, you really really don't.
> >
> > TSX was slow even when it worked and didn't have aborts, and never gave you "best-case"
> > performance at all due to that. Simple non-contended non-TSX locks worked better.
> >
> > And TSX was a complete disaster when you had any data contention, and just caused overhead and aborts
> > and fallbacks to locked code, so - no surprise - plain non-TSX locks worked better. And data contention
> > is quite common, and happened for a lot of trivial reasons (statistics being one).
> >
> > And no, TSX didn't have better average performance either, because in order to avoid the
> > problems, you had to do statistics in software, which added its own set of overhead.
> >
> > As far as I know, there were approximately zero real-world loads that were better with TSX than without.
> >
> > The only case that TSX ever did ok on was when there was zero data contention at all, and lots of cache
> > coherence costs due almost entirely due to locking, and then TSX can keep the lock as a shared cache
> > line. Yes, this really can happen, but most of the time it happens is when you also have big enough
> > locked regions that they don't get caught by the transactional memory due to size overflows.
> >
> > And making the transaction size larger makes the costs higher too, so now you need to do a
> > much better job at predicting ahead of time whether transactions will succeed or not. Which
> > Intel entirely screwed up, and I blame them completely. I told them at the first meeting
> > they had (before TSX was public) that they need to add a TSX predictor, and they never did.
> >
> > And the problems with TSX were legion, including dat aleaks and actual outright memory ordering bugs.
> >
> > TSX was garbage, and remains so.
> >
> > This is not to say that you couldn't get transactional memory
> > right, but as it stands right now, I do not believe
> > that anybody has ever had an actual successful and useful implementation of transactional memory.
> >
> > And I can pretty much guarantee that to do it right you need to have a transaction success predictor
> > (like a branch predictor) so that software doesn't have to deal with yet another issue of "on
> > this uarch, and this load, the transaction size is too small to fit this lock".
> >
> > I'm surprised that ARM made it part of v9 (and surprised that ARM kept the 32-bit
> > compatibility part - I really thought they wanted to get rid of it).
>
> It is interesting. Of the ARM implementations, the only ones I would have any hope at
> all of implementing transactional memory would be Apple. But they don't seem to be making
> moves to big server CPUs with a lot of cores and their basic locking and coherency should
> be plenty fast enough to scale to a handful of cores on one chip very well.
>
> Maybe they're planning to get into servers? Or maybe some of the vendors pushing a lot of cores like
> Ampere are pushing for it hoping it will help their scalability issues. I don't think much of ARM Ltd's
> chances of making something that works well in that case, but stranger things have happened.
It's most likely that they added because some vendors or large hyperscale level customers of the vendors have been asking for it. There don't to be enough existence proofs of its effectiveness in real-world usage to have added it purely on quantitative terms.