By: Andrey (andrey.semashev.delete@this.gmail.com), April 7, 2021 2:32 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on April 7, 2021 1:20 pm wrote:
>
> I personally would expect that the most likely actual true fix to the HTM problems is to only allow
> small transactions in the first place, prove statically that they can complete (kind of like the constrained
> s390 transactions), and avoid the whole software-visible abort/fallback situation entirely.
>
> IOW, make HTM act like any other CPU speculation fault where the
> hardware will retry and know it can complete it eventually.
>
> IOW, take the "RISC approach" to HTM. Make it simple, make it stupid, but
> make that simple and stupid case go really really fast, exactly because it
> doesn't handle all the complexities, and doesn't involve a lot of state.
>
> Don't do any "register state rollback" garbage: the transactional register state is limited to the normal
> register renames, and it's basically all done by the existing OoO speculative execution engine.
>
> Don't do any "transactional L1 cache contents" garbage. Transactional stores are limited to the store queue,
> and the cache conflict tracing is done using the existing memory reordering speculation hardware.
>
> Don't do any "software retry" garbage: the architected HTM size is guaranteed to make
> forward progress because it is of limited size (exactly like "ll/sc" sequences).
LL/SC is a very restricted version of HTM. You do have to retry in software if SC fails, although you normally don't implement a fallback path hoping that it will eventually succeed. Note that forward progress is not guaranteed with LL/SC. That is unlike regular atomics, which makes them superior.
I think the "forward progress" arguments that are made throughout this discussion are missing that you can't have a forward progress guarantee as long as you have (a) preemption and (b) virtual memory and paging. Basically, if your thread can be interrupted at an arbitrary point and that can cause an abort, you don't have forward progress guarantee. That doesn't mean that transactions will abort often, but it does mean that aborts will happen regardless of the programmer's effort and you have to have some sort of a retry and possibly a fallback path that is less likely to fail. And yes, as you noted, this requirement is the more emphasized the larger the transaction.
Regarding the abort semantics, i.e. whether the memory and register state is restored, I don't think this is a crucial part for a particular HTM implementation success. I mean, surely you want at least some memory accesses to be rolled back (or simply to not commit) in case of transactional abort, but it doesn't have to be all accesses, as in case of TSX. I could imagine a workable model where the particular accesses to be monitored and rolled back in case of abort would have to be done with special instructions (something like AMD ASF). This model would be less convenient to program, but still acceptable. Whether register state is restored is also not very important if there is a way to restore them upon retry.
What is important for a HTM implementation success is (a) the cost of successful path and (b) the probability and cost of aborts. With (a) it's pretty clear - the cost must approach the cost of an atomic operation (I'm not counting the transaction body here, of course). With (b), you have to accept that aborts will happen, and that you have to program a retry and/or fallback path, but the implementation must minimize the probability of aborts due to hardware events and likely provide ways to minimize aborts on the software side as well. That is, a page fault, an interrupt, a thread preemption - all these must not cause an abort, unless the memory conflict actually happened during this event.
> In that situation, a transaction that is bigger than the architected hardware resources would
> not be an "abort" - it would just be a fatal error, like a divide-by-zero is, and would trap.
> It would be very expensive indeed, but that would be ok - because there would be no "retry with
> fallback" for that case. It would have been a programmer error to have generated such an instruction
> sequence in the first place, exactly like it's a programmer error to divide by zero.
Except that the programmer doesn't generate instructions usually. Unlike divide by zero, the programmer cannot ensure that a particular sequence of instructions is generated, unless he writes in assembler. And I doubt a technology that basically requires writing in assembler is going to succeed beyond a few niche cases.
> Personally, I'd much rather see hardware companies try to start from that kind of very targeted
> (and less ambitious) HTM. Maybe once you get the simple cases working, and learn from practice what
> it helps and what small extensions to the model you could add to improve on things, you could expand
> on the HTM architected limits in a controlled manner and do incremental improvements.
>
> Instead of the current crazy "let's do lock elision in hardware without knowing what the f*ck we're
> doing, and let's make it so complex that it doesn't actually work" that everybody has done.
Well, the incremental approach is certainly reasonable. But whatever small starting solution you implement, you still have to make sure it is generally useful. If it requires a domain expert level programmers or isn't usable in higher level languages or doesn't support widespread use cases then - however nice and simple it is - it will be just a toy technology with no wide use, and therefore not worth the cost and effort of implementing. You have to start with something more widely applicable and useful.
>
> I personally would expect that the most likely actual true fix to the HTM problems is to only allow
> small transactions in the first place, prove statically that they can complete (kind of like the constrained
> s390 transactions), and avoid the whole software-visible abort/fallback situation entirely.
>
> IOW, make HTM act like any other CPU speculation fault where the
> hardware will retry and know it can complete it eventually.
>
> IOW, take the "RISC approach" to HTM. Make it simple, make it stupid, but
> make that simple and stupid case go really really fast, exactly because it
> doesn't handle all the complexities, and doesn't involve a lot of state.
>
> Don't do any "register state rollback" garbage: the transactional register state is limited to the normal
> register renames, and it's basically all done by the existing OoO speculative execution engine.
>
> Don't do any "transactional L1 cache contents" garbage. Transactional stores are limited to the store queue,
> and the cache conflict tracing is done using the existing memory reordering speculation hardware.
>
> Don't do any "software retry" garbage: the architected HTM size is guaranteed to make
> forward progress because it is of limited size (exactly like "ll/sc" sequences).
LL/SC is a very restricted version of HTM. You do have to retry in software if SC fails, although you normally don't implement a fallback path hoping that it will eventually succeed. Note that forward progress is not guaranteed with LL/SC. That is unlike regular atomics, which makes them superior.
I think the "forward progress" arguments that are made throughout this discussion are missing that you can't have a forward progress guarantee as long as you have (a) preemption and (b) virtual memory and paging. Basically, if your thread can be interrupted at an arbitrary point and that can cause an abort, you don't have forward progress guarantee. That doesn't mean that transactions will abort often, but it does mean that aborts will happen regardless of the programmer's effort and you have to have some sort of a retry and possibly a fallback path that is less likely to fail. And yes, as you noted, this requirement is the more emphasized the larger the transaction.
Regarding the abort semantics, i.e. whether the memory and register state is restored, I don't think this is a crucial part for a particular HTM implementation success. I mean, surely you want at least some memory accesses to be rolled back (or simply to not commit) in case of transactional abort, but it doesn't have to be all accesses, as in case of TSX. I could imagine a workable model where the particular accesses to be monitored and rolled back in case of abort would have to be done with special instructions (something like AMD ASF). This model would be less convenient to program, but still acceptable. Whether register state is restored is also not very important if there is a way to restore them upon retry.
What is important for a HTM implementation success is (a) the cost of successful path and (b) the probability and cost of aborts. With (a) it's pretty clear - the cost must approach the cost of an atomic operation (I'm not counting the transaction body here, of course). With (b), you have to accept that aborts will happen, and that you have to program a retry and/or fallback path, but the implementation must minimize the probability of aborts due to hardware events and likely provide ways to minimize aborts on the software side as well. That is, a page fault, an interrupt, a thread preemption - all these must not cause an abort, unless the memory conflict actually happened during this event.
> In that situation, a transaction that is bigger than the architected hardware resources would
> not be an "abort" - it would just be a fatal error, like a divide-by-zero is, and would trap.
> It would be very expensive indeed, but that would be ok - because there would be no "retry with
> fallback" for that case. It would have been a programmer error to have generated such an instruction
> sequence in the first place, exactly like it's a programmer error to divide by zero.
Except that the programmer doesn't generate instructions usually. Unlike divide by zero, the programmer cannot ensure that a particular sequence of instructions is generated, unless he writes in assembler. And I doubt a technology that basically requires writing in assembler is going to succeed beyond a few niche cases.
> Personally, I'd much rather see hardware companies try to start from that kind of very targeted
> (and less ambitious) HTM. Maybe once you get the simple cases working, and learn from practice what
> it helps and what small extensions to the model you could add to improve on things, you could expand
> on the HTM architected limits in a controlled manner and do incremental improvements.
>
> Instead of the current crazy "let's do lock elision in hardware without knowing what the f*ck we're
> doing, and let's make it so complex that it doesn't actually work" that everybody has done.
Well, the incremental approach is certainly reasonable. But whatever small starting solution you implement, you still have to make sure it is generally useful. If it requires a domain expert level programmers or isn't usable in higher level languages or doesn't support widespread use cases then - however nice and simple it is - it will be just a toy technology with no wide use, and therefore not worth the cost and effort of implementing. You have to start with something more widely applicable and useful.