By: dmcq (dmcq.delete@this.fano.co.uk), April 1, 2021 10:35 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on April 1, 2021 10:00 am wrote:
> @never_released (never_released.delete@this.gmx.tw) on April 1, 2021 8:21 am wrote:
> >
> > That however doesn't change the fact that you sometimes don't want to start the transaction
> > in the first place, which will have to be handled in software.
>
> This is the big thing.
>
> If you use TM to do general-purpose lock elision, you absolutely will get cases where
> the transaction will basically always fail because of transaction capacity issues (or possibly
> due to the locked region doing something that cannot be transactional, like IO).
>
> Now, part of that can be handled by the compiler being aware of all the transaction limits,
> but that kind of static knowledge is not simply not sufficient for very fundamental reasons.
> One of those reasons being dynamic behavior (think conditional branches), another being
> simply that the limits will depend on the microarchitectural details.
>
> End result: static "prediction" by the compiler on whether a transaction will succeed or not is pure
> and utter garbage. It's not useful for a serious transactional memory model, unless you are ok with
> limiting the transaction size to some architectural minimum guarantee (which means that it's not going
> to be all that commonly useful, and certainly not the promised "elide locks in the common case").
>
> So if the static prediction by the compiler isn't good enough, you have three choices:
>
> (a) don't predict at all (outside of the truly obvious case where the compiler can see "this has no
> chance at all"), and take the failure case every time (or at least quite often), go to the slow case
>
> (b) add dynamic prediction in software
>
> (c) do the dynamic prediction in hardware.
>
> And my claim is that (a) is not useful - it's going to be slower than not having TM in
> the first place, and that (b) is much too expensive unless your load is very controlled
> and you can limit it some way. Which leaves you (c), and nobody has ever done it correctly,
> and as mentioned there is absolutely zero sign that ARM is doing it either.
>
> End result: the ARM implementation is the same broken stuff we have
> seen before that has already been shown to not be good enough.
>
> And yes, I'm disappointed.
>
> Side note: I've seen this before at Transmeta. It was a different kind of transactional
> memory (not for threading, but for speculation), but it had all the same issues. There
> we used another approach: predict in the JIT, and if it's wrong, just re-JIT the code.
>
> That JIT approach does work, but it only works if the high-performance code you care about
> is JIT'ed, and that's not generally the case for something like an ARM server chip.
>
> Linus
I am surprised by ARM then. It doesn't sound like they have much chance of getting a worthwhile facility given the experience others have had. And it isn't as though this had to be in ARMv9 from the start, unlike the atomics and SVE2 which really do need to be part of a server baseline.
> @never_released (never_released.delete@this.gmx.tw) on April 1, 2021 8:21 am wrote:
> >
> > That however doesn't change the fact that you sometimes don't want to start the transaction
> > in the first place, which will have to be handled in software.
>
> This is the big thing.
>
> If you use TM to do general-purpose lock elision, you absolutely will get cases where
> the transaction will basically always fail because of transaction capacity issues (or possibly
> due to the locked region doing something that cannot be transactional, like IO).
>
> Now, part of that can be handled by the compiler being aware of all the transaction limits,
> but that kind of static knowledge is not simply not sufficient for very fundamental reasons.
> One of those reasons being dynamic behavior (think conditional branches), another being
> simply that the limits will depend on the microarchitectural details.
>
> End result: static "prediction" by the compiler on whether a transaction will succeed or not is pure
> and utter garbage. It's not useful for a serious transactional memory model, unless you are ok with
> limiting the transaction size to some architectural minimum guarantee (which means that it's not going
> to be all that commonly useful, and certainly not the promised "elide locks in the common case").
>
> So if the static prediction by the compiler isn't good enough, you have three choices:
>
> (a) don't predict at all (outside of the truly obvious case where the compiler can see "this has no
> chance at all"), and take the failure case every time (or at least quite often), go to the slow case
>
> (b) add dynamic prediction in software
>
> (c) do the dynamic prediction in hardware.
>
> And my claim is that (a) is not useful - it's going to be slower than not having TM in
> the first place, and that (b) is much too expensive unless your load is very controlled
> and you can limit it some way. Which leaves you (c), and nobody has ever done it correctly,
> and as mentioned there is absolutely zero sign that ARM is doing it either.
>
> End result: the ARM implementation is the same broken stuff we have
> seen before that has already been shown to not be good enough.
>
> And yes, I'm disappointed.
>
> Side note: I've seen this before at Transmeta. It was a different kind of transactional
> memory (not for threading, but for speculation), but it had all the same issues. There
> we used another approach: predict in the JIT, and if it's wrong, just re-JIT the code.
>
> That JIT approach does work, but it only works if the high-performance code you care about
> is JIT'ed, and that's not generally the case for something like an ARM server chip.
>
> Linus
I am surprised by ARM then. It doesn't sound like they have much chance of getting a worthwhile facility given the experience others have had. And it isn't as though this had to be in ARMv9 from the start, unlike the atomics and SVE2 which really do need to be part of a server baseline.