By: dmcq (dmcq.delete@this.fano.co.uk), April 7, 2021 3:32 pm
Room: Moderated Discussions
Andrey (andrey.semashev.delete@this.gmail.com) on April 7, 2021 2:32 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on April 7, 2021 1:20 pm wrote:
> >
> > I personally would expect that the most likely actual true fix to the HTM problems is to only allow
> > small transactions in the first place, prove statically
> > that they can complete (kind of like the constrained
> > s390 transactions), and avoid the whole software-visible abort/fallback situation entirely.
> >
> > IOW, make HTM act like any other CPU speculation fault where the
> > hardware will retry and know it can complete it eventually.
> >
> > IOW, take the "RISC approach" to HTM. Make it simple, make it stupid, but
> > make that simple and stupid case go really really fast, exactly because it
> > doesn't handle all the complexities, and doesn't involve a lot of state.
> >
> > Don't do any "register state rollback" garbage: the transactional register state is limited to the normal
> > register renames, and it's basically all done by the existing OoO speculative execution engine.
> >
> > Don't do any "transactional L1 cache contents" garbage.
> > Transactional stores are limited to the store queue,
> > and the cache conflict tracing is done using the existing memory reordering speculation hardware.
> >
> > Don't do any "software retry" garbage: the architected HTM size is guaranteed to make
> > forward progress because it is of limited size (exactly like "ll/sc" sequences).
>
> LL/SC is a very restricted version of HTM. You do have to retry in software if SC fails, although you
> normally don't implement a fallback path hoping that it will eventually succeed. Note that forward progress
> is not guaranteed with LL/SC. That is unlike regular atomics, which makes them superior.
>
> I think the "forward progress" arguments that are made throughout this discussion are missing that
> you can't have a forward progress guarantee as long as you have (a) preemption and (b) virtual memory
> and paging. Basically, if your thread can be interrupted at an arbitrary point and that can cause
> an abort, you don't have forward progress guarantee. That doesn't mean that transactions will abort
> often, but it does mean that aborts will happen regardless of the programmer's effort and you have
> to have some sort of a retry and possibly a fallback path that is less likely to fail. And yes,
> as you noted, this requirement is the more emphasized the larger the transaction.
Aborts due to paging certainly will happen, but there certainly is no reason for asynchronous aborts of short code sequences. Such interrupts should be deferred and vectoring to a proceeeor for the task can be used f the nterrupt time is absolutely critical.
> Regarding the abort semantics, i.e. whether the memory and register state is restored, I don't think this
> is a crucial part for a particular HTM implementation success. I mean, surely you want at least some
> memory accesses to be rolled back (or simply to not commit) in case of transactional abort, but it doesn't
> have to be all accesses, as in case of TSX. I could imagine a workable model where the particular accesses
> to be monitored and rolled back in case of abort would have to be done with special instructions (something
> like AMD ASF). This model would be less convenient to program, but still acceptable. Whether register
> state is restored is also not very important if there is a way to restore them upon retry.
>
> What is important for a HTM implementation success is (a) the cost of successful path and (b) the probability
> and cost of aborts. With (a) it's pretty clear - the cost must approach the cost of an atomic operation
> (I'm not counting the transaction body here, of course). With (b), you have to accept that aborts will
> happen, and that you have to program a retry and/or fallback path, but the implementation must minimize
> the probability of aborts due to hardware events and likely provide ways to minimize aborts on the software
> side as well. That is, a page fault, an interrupt, a thread preemption - all these must not cause an
> abort, unless the memory conflict actually happened during this event.
I don't think one can avoid having to save some registers but I don't see that as a major problem.
> > In that situation, a transaction that is bigger than the architected hardware resources would
> > not be an "abort" - it would just be a fatal error, like a divide-by-zero is, and would trap.
> > It would be very expensive indeed, but that would be ok - because there would be no "retry with
> > fallback" for that case. It would have been a programmer error to have generated such an instruction
> > sequence in the first place, exactly like it's a programmer error to divide by zero.
>
> Except that the programmer doesn't generate instructions usually. Unlike divide by zero, the programmer cannot
> ensure that a particular sequence of instructions is generated, unless he writes in assembler. And I doubt
> a technology that basically requires writing in assembler is going to succeed beyond a few niche cases.
I think it should be possible to gve some minimum figures for what is supported and run a computer check. If some mplementations provide more then they would be warned that they use more than some standard. It woud be much simpler than saying what a GPU supports. I don't think assembler would need to be used and assembler would only be for specific important optimisations.
> > Personally, I'd much rather see hardware companies try to start from that kind of very targeted
> > (and less ambitious) HTM. Maybe once you get the simple cases working, and learn from practice what
> > it helps and what small extensions to the model you could add to improve on things, you could expand
> > on the HTM architected limits in a controlled manner and do incremental improvements.
> >
> > Instead of the current crazy "let's do lock elision in hardware without knowing what the f*ck we're
> > doing, and let's make it so complex that it doesn't actually work" that everybody has done.
>
> Well, the incremental approach is certainly reasonable. But whatever small starting solution you implement,
> you still have to make sure it is generally useful. If it requires a domain expert level programmers or
> isn't usable in higher level languages or doesn't support widespread use cases then - however nice and
> simple it is - it will be just a toy technology with no wide use, and therefore not worth the cost and
> effort of implementing. You have to start with something more widely applicable and useful.
This whole area requires good programmers and well checked code. Atomics and locks require some care anyway and everyone should think twice, three or more times before coming anywhere near lock free code.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on April 7, 2021 1:20 pm wrote:
> >
> > I personally would expect that the most likely actual true fix to the HTM problems is to only allow
> > small transactions in the first place, prove statically
> > that they can complete (kind of like the constrained
> > s390 transactions), and avoid the whole software-visible abort/fallback situation entirely.
> >
> > IOW, make HTM act like any other CPU speculation fault where the
> > hardware will retry and know it can complete it eventually.
> >
> > IOW, take the "RISC approach" to HTM. Make it simple, make it stupid, but
> > make that simple and stupid case go really really fast, exactly because it
> > doesn't handle all the complexities, and doesn't involve a lot of state.
> >
> > Don't do any "register state rollback" garbage: the transactional register state is limited to the normal
> > register renames, and it's basically all done by the existing OoO speculative execution engine.
> >
> > Don't do any "transactional L1 cache contents" garbage.
> > Transactional stores are limited to the store queue,
> > and the cache conflict tracing is done using the existing memory reordering speculation hardware.
> >
> > Don't do any "software retry" garbage: the architected HTM size is guaranteed to make
> > forward progress because it is of limited size (exactly like "ll/sc" sequences).
>
> LL/SC is a very restricted version of HTM. You do have to retry in software if SC fails, although you
> normally don't implement a fallback path hoping that it will eventually succeed. Note that forward progress
> is not guaranteed with LL/SC. That is unlike regular atomics, which makes them superior.
>
> I think the "forward progress" arguments that are made throughout this discussion are missing that
> you can't have a forward progress guarantee as long as you have (a) preemption and (b) virtual memory
> and paging. Basically, if your thread can be interrupted at an arbitrary point and that can cause
> an abort, you don't have forward progress guarantee. That doesn't mean that transactions will abort
> often, but it does mean that aborts will happen regardless of the programmer's effort and you have
> to have some sort of a retry and possibly a fallback path that is less likely to fail. And yes,
> as you noted, this requirement is the more emphasized the larger the transaction.
Aborts due to paging certainly will happen, but there certainly is no reason for asynchronous aborts of short code sequences. Such interrupts should be deferred and vectoring to a proceeeor for the task can be used f the nterrupt time is absolutely critical.
> Regarding the abort semantics, i.e. whether the memory and register state is restored, I don't think this
> is a crucial part for a particular HTM implementation success. I mean, surely you want at least some
> memory accesses to be rolled back (or simply to not commit) in case of transactional abort, but it doesn't
> have to be all accesses, as in case of TSX. I could imagine a workable model where the particular accesses
> to be monitored and rolled back in case of abort would have to be done with special instructions (something
> like AMD ASF). This model would be less convenient to program, but still acceptable. Whether register
> state is restored is also not very important if there is a way to restore them upon retry.
>
> What is important for a HTM implementation success is (a) the cost of successful path and (b) the probability
> and cost of aborts. With (a) it's pretty clear - the cost must approach the cost of an atomic operation
> (I'm not counting the transaction body here, of course). With (b), you have to accept that aborts will
> happen, and that you have to program a retry and/or fallback path, but the implementation must minimize
> the probability of aborts due to hardware events and likely provide ways to minimize aborts on the software
> side as well. That is, a page fault, an interrupt, a thread preemption - all these must not cause an
> abort, unless the memory conflict actually happened during this event.
I don't think one can avoid having to save some registers but I don't see that as a major problem.
> > In that situation, a transaction that is bigger than the architected hardware resources would
> > not be an "abort" - it would just be a fatal error, like a divide-by-zero is, and would trap.
> > It would be very expensive indeed, but that would be ok - because there would be no "retry with
> > fallback" for that case. It would have been a programmer error to have generated such an instruction
> > sequence in the first place, exactly like it's a programmer error to divide by zero.
>
> Except that the programmer doesn't generate instructions usually. Unlike divide by zero, the programmer cannot
> ensure that a particular sequence of instructions is generated, unless he writes in assembler. And I doubt
> a technology that basically requires writing in assembler is going to succeed beyond a few niche cases.
I think it should be possible to gve some minimum figures for what is supported and run a computer check. If some mplementations provide more then they would be warned that they use more than some standard. It woud be much simpler than saying what a GPU supports. I don't think assembler would need to be used and assembler would only be for specific important optimisations.
> > Personally, I'd much rather see hardware companies try to start from that kind of very targeted
> > (and less ambitious) HTM. Maybe once you get the simple cases working, and learn from practice what
> > it helps and what small extensions to the model you could add to improve on things, you could expand
> > on the HTM architected limits in a controlled manner and do incremental improvements.
> >
> > Instead of the current crazy "let's do lock elision in hardware without knowing what the f*ck we're
> > doing, and let's make it so complex that it doesn't actually work" that everybody has done.
>
> Well, the incremental approach is certainly reasonable. But whatever small starting solution you implement,
> you still have to make sure it is generally useful. If it requires a domain expert level programmers or
> isn't usable in higher level languages or doesn't support widespread use cases then - however nice and
> simple it is - it will be just a toy technology with no wide use, and therefore not worth the cost and
> effort of implementing. You have to start with something more widely applicable and useful.
This whole area requires good programmers and well checked code. Atomics and locks require some care anyway and everyone should think twice, three or more times before coming anywhere near lock free code.