By: dmcq (dmcq.delete@this.fano.co.uk), August 22, 2022 7:14 am
Room: Moderated Discussions
anon2 (anon.delete@this.anon.com) on August 21, 2022 9:31 pm wrote:
> rwessel (rwessel.delete@this.yahoo.com) on August 21, 2022 8:02 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 21, 2022 6:58 pm wrote:
> > > --- (---.delete@this.redheron.com) on August 21, 2022 1:26 pm wrote:
> > > > What you lose is the SW ease of HTM,
> > >
> > > "SW ease of HTM" never existed.
> > >
> > > In fact, I would argue that one of the big failures of HTM was to aim for it being some
> > > kind of generic thing in the first place. That just made it big, intrusive, and fragile.
> > >
> > > Large transactions are a mistake. They are fundamentally
> > > fragile. It's like designing for a glass jaw: "this
> > > will go really well if all the stars align, but if they don't it's going to be expensive as heck".
> > >
> > > And the bigger the transaction, the bigger the cost of failure, but also the
> > > likelihood of failure goes up. A lot. And it goes up the most in the exact
> > > situation where you do not want it to go up - when there is contention.
> > >
> > > So I'd much rather have something small, simple, and predictably high-performance.
> > >
> > > IOW, I'd rather get something like perhaps just an extended load-linked / store-conditional model. Very
> > > much with the hardware guaranteed forward progress that LL/SC has. Just extended to a couple more accesses.
> > > Take LL/SC and make it LLx/SCx - and keep the forward progress guarantees. In hardware.
> > >
> > > And you can only give those forward progress guarantees if you end up just having code
> > > length limits, and seriously limit the number of accesses. And keep it simple enough that
> > > hardware people (and software people) can actually believe that you get it right.
> > >
> > > For example, forward progress guarantees almost certainly mean that you'd probably have to do
> > > something like order the accesses by physical address when you repeat, so that you can then actually
> > > force them to stay in cache until completion without getting into nasty deadlocks.
> > >
> > > Don't order things on the first try, probably not on the second, but if you've seen several
> > > failed sequences in a row with no successes, keep track of the addresses that are used, and
> > > lock them in cache (within those very strict limits you set), but in some physical address
> > > order so that you don't get ABBA deadlocks between different cores doing different accesses.
> > >
> > > With a small enough N, that should be eminently doable. But "small enough N" is small! We know N=1 works,
> > > but it's not good enough if you want to check a lock (one
> > > cacheline) and do a doubly linked list update (two
> > > more cachelines) at the same time. But maybe N=4 is big enough to be useful, but small enough that hardware
> > > can easily still deal with inconvenient ordering requirements and actually give forward progress guarantees.
> > >
> > > Also, just to hit N=4, you'd probably need your L1 cache to be at least 4-way associative,
> > > because you need to be able to guarantee that you can keep those four accesses in
> > > cache all the time. Even if they all were to hit in the same set.
> > >
> > > So forward progress guarantees is one very basic requirement for it being easy to use, but it also
> > > means that the size of the transaction has to be small. Definitely no more "random C code".
> > >
> > > Just like LL/SC.
> > >
> > > The other "make it actually easy to use for real" requirement is no asynchronous aborts.
> > > You can have a hard exception on some "hey, you tried to do more than N accesses", but
> > > that's a software bug, it wouldn't be a "let's retry without using transactions".
> > >
> > > IOW, really make it act like LL/SC - an LL doesn't need to save state for recovery, and a SC failure
> > > doesn't "abort", it doesn't go to some other code sequence, it just writes a value to a register that
> > > the user can test for "did this sequence of writes complete fully or not", and then just repeat.
> > >
> > > And because forward progress is guaranteed, SW doesn't need to have fallback
> > > code, so it really just becomes that "repeat on failure" loop.
> > >
> > > Just like LL/SC.
> > >
> > > I'll take something simple and reliable (and reliably fast) over the HTM mess every time.
> > >
> > > In the kernel, we already use a very special version of "N=2" by just putting the lock and the data
> > > structure it protects next to each other, so that we can use a double-wide atomic access do do them
> > > both in one go. But it requires some very special data structures to do that, and requires that
> > > the lock has been split out to be a per-data thing. Which is not realistic for most things.
> > >
> > > And sure, a compiler could still recognize simple patterns and turn them into LLx/SCx
> > > sequences. But more likely, it would all be in libraries and language runtimes.
> >
> >
> > Constrained transactions on Z do a lot of that. They guarantee eventual completion, with the processor
> > doing things as necessary to boost the chances of success on retries. The size of constrained transactions
> > is more limited (no more than 32 instructions, data accesses to no more than four cache lines), and
> > some additional instructions are restricted, as are operands. But with some limitations, the transaction
> > can just be retried until it completes, and no fallback code is required. The retry is pretty simple
> > too, basically branch back to the TBEGINC based on the condition code.
>
> And those are being removed too.
>
> I think the reality is that the memory pipeline and cache coherency is a much more difficult problem than software
> developers understand. "We can take two nested locks so you should be able to atomically push stores to two
> cache lines, just use our unfathomably intelligent technique that nobody else ever would have thought about
> of using cache line address to sort the arbitration ordering" probably doesn't quite cut the mustard.
However difficult it is I'm sure it's less of a minefield rhan the thousands of sharp knives tipped with agonizing poison that is lock-free programming. I'd have thought a constrained transaction with four data locations could do most anything one could do with that.
> rwessel (rwessel.delete@this.yahoo.com) on August 21, 2022 8:02 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 21, 2022 6:58 pm wrote:
> > > --- (---.delete@this.redheron.com) on August 21, 2022 1:26 pm wrote:
> > > > What you lose is the SW ease of HTM,
> > >
> > > "SW ease of HTM" never existed.
> > >
> > > In fact, I would argue that one of the big failures of HTM was to aim for it being some
> > > kind of generic thing in the first place. That just made it big, intrusive, and fragile.
> > >
> > > Large transactions are a mistake. They are fundamentally
> > > fragile. It's like designing for a glass jaw: "this
> > > will go really well if all the stars align, but if they don't it's going to be expensive as heck".
> > >
> > > And the bigger the transaction, the bigger the cost of failure, but also the
> > > likelihood of failure goes up. A lot. And it goes up the most in the exact
> > > situation where you do not want it to go up - when there is contention.
> > >
> > > So I'd much rather have something small, simple, and predictably high-performance.
> > >
> > > IOW, I'd rather get something like perhaps just an extended load-linked / store-conditional model. Very
> > > much with the hardware guaranteed forward progress that LL/SC has. Just extended to a couple more accesses.
> > > Take LL/SC and make it LLx/SCx - and keep the forward progress guarantees. In hardware.
> > >
> > > And you can only give those forward progress guarantees if you end up just having code
> > > length limits, and seriously limit the number of accesses. And keep it simple enough that
> > > hardware people (and software people) can actually believe that you get it right.
> > >
> > > For example, forward progress guarantees almost certainly mean that you'd probably have to do
> > > something like order the accesses by physical address when you repeat, so that you can then actually
> > > force them to stay in cache until completion without getting into nasty deadlocks.
> > >
> > > Don't order things on the first try, probably not on the second, but if you've seen several
> > > failed sequences in a row with no successes, keep track of the addresses that are used, and
> > > lock them in cache (within those very strict limits you set), but in some physical address
> > > order so that you don't get ABBA deadlocks between different cores doing different accesses.
> > >
> > > With a small enough N, that should be eminently doable. But "small enough N" is small! We know N=1 works,
> > > but it's not good enough if you want to check a lock (one
> > > cacheline) and do a doubly linked list update (two
> > > more cachelines) at the same time. But maybe N=4 is big enough to be useful, but small enough that hardware
> > > can easily still deal with inconvenient ordering requirements and actually give forward progress guarantees.
> > >
> > > Also, just to hit N=4, you'd probably need your L1 cache to be at least 4-way associative,
> > > because you need to be able to guarantee that you can keep those four accesses in
> > > cache all the time. Even if they all were to hit in the same set.
> > >
> > > So forward progress guarantees is one very basic requirement for it being easy to use, but it also
> > > means that the size of the transaction has to be small. Definitely no more "random C code".
> > >
> > > Just like LL/SC.
> > >
> > > The other "make it actually easy to use for real" requirement is no asynchronous aborts.
> > > You can have a hard exception on some "hey, you tried to do more than N accesses", but
> > > that's a software bug, it wouldn't be a "let's retry without using transactions".
> > >
> > > IOW, really make it act like LL/SC - an LL doesn't need to save state for recovery, and a SC failure
> > > doesn't "abort", it doesn't go to some other code sequence, it just writes a value to a register that
> > > the user can test for "did this sequence of writes complete fully or not", and then just repeat.
> > >
> > > And because forward progress is guaranteed, SW doesn't need to have fallback
> > > code, so it really just becomes that "repeat on failure" loop.
> > >
> > > Just like LL/SC.
> > >
> > > I'll take something simple and reliable (and reliably fast) over the HTM mess every time.
> > >
> > > In the kernel, we already use a very special version of "N=2" by just putting the lock and the data
> > > structure it protects next to each other, so that we can use a double-wide atomic access do do them
> > > both in one go. But it requires some very special data structures to do that, and requires that
> > > the lock has been split out to be a per-data thing. Which is not realistic for most things.
> > >
> > > And sure, a compiler could still recognize simple patterns and turn them into LLx/SCx
> > > sequences. But more likely, it would all be in libraries and language runtimes.
> >
> >
> > Constrained transactions on Z do a lot of that. They guarantee eventual completion, with the processor
> > doing things as necessary to boost the chances of success on retries. The size of constrained transactions
> > is more limited (no more than 32 instructions, data accesses to no more than four cache lines), and
> > some additional instructions are restricted, as are operands. But with some limitations, the transaction
> > can just be retried until it completes, and no fallback code is required. The retry is pretty simple
> > too, basically branch back to the TBEGINC based on the condition code.
>
> And those are being removed too.
>
> I think the reality is that the memory pipeline and cache coherency is a much more difficult problem than software
> developers understand. "We can take two nested locks so you should be able to atomically push stores to two
> cache lines, just use our unfathomably intelligent technique that nobody else ever would have thought about
> of using cache line address to sort the arbitration ordering" probably doesn't quite cut the mustard.
However difficult it is I'm sure it's less of a minefield rhan the thousands of sharp knives tipped with agonizing poison that is lock-free programming. I'd have thought a constrained transaction with four data locations could do most anything one could do with that.