By: dmcq (dmcq.delete@this.fano.co.uk), August 24, 2022 3:50 am
Room: Moderated Discussions
anon2 (anon.delete@this.anon.com) on August 23, 2022 12:15 am wrote:
> dmcq (dmcq.delete@this.fano.co.uk) on August 22, 2022 7:14 am wrote:
> > anon2 (anon.delete@this.anon.com) on August 21, 2022 9:31 pm wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on August 21, 2022 8:02 pm wrote:
> > > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 21, 2022 6:58 pm wrote:
> > > > > --- (---.delete@this.redheron.com) on August 21, 2022 1:26 pm wrote:
> > > > > > What you lose is the SW ease of HTM,
> > > > >
> > > > > "SW ease of HTM" never existed.
> > > > >
> > > > > In fact, I would argue that one of the big failures of HTM was to aim for it being some
> > > > > kind of generic thing in the first place. That just made it big, intrusive, and fragile.
> > > > >
> > > > > Large transactions are a mistake. They are fundamentally
> > > > > fragile. It's like designing for a glass jaw: "this
> > > > > will go really well if all the stars align, but if they don't it's going to be expensive as heck".
> > > > >
> > > > > And the bigger the transaction, the bigger the cost of failure, but also the
> > > > > likelihood of failure goes up. A lot. And it goes up the most in the exact
> > > > > situation where you do not want it to go up - when there is contention.
> > > > >
> > > > > So I'd much rather have something small, simple, and predictably high-performance.
> > > > >
> > > > > IOW, I'd rather get something like perhaps just an extended load-linked / store-conditional model. Very
> > > > > much with the hardware guaranteed forward progress that LL/SC has. Just extended to a couple more accesses.
> > > > > Take LL/SC and make it LLx/SCx - and keep the forward progress guarantees. In hardware.
> > > > >
> > > > > And you can only give those forward progress guarantees if you end up just having code
> > > > > length limits, and seriously limit the number of accesses. And keep it simple enough that
> > > > > hardware people (and software people) can actually believe that you get it right.
> > > > >
> > > > > For example, forward progress guarantees almost certainly mean that you'd probably have to do
> > > > > something like order the accesses by physical address when you repeat, so that you can then actually
> > > > > force them to stay in cache until completion without getting into nasty deadlocks.
> > > > >
> > > > > Don't order things on the first try, probably not on the second, but if you've seen several
> > > > > failed sequences in a row with no successes, keep track of the addresses that are used, and
> > > > > lock them in cache (within those very strict limits you set), but in some physical address
> > > > > order so that you don't get ABBA deadlocks between different cores doing different accesses.
> > > > >
> > > > > With a small enough N, that should be eminently doable. But "small enough N" is small! We know N=1 works,
> > > > > but it's not good enough if you want to check a lock (one
> > > > > cacheline) and do a doubly linked list update (two
> > > > > more cachelines) at the same time. But maybe N=4 is big enough to be useful, but small enough that hardware
> > > > > can easily still deal with inconvenient ordering requirements and actually give forward progress guarantees.
> > > > >
> > > > > Also, just to hit N=4, you'd probably need your L1 cache to be at least 4-way associative,
> > > > > because you need to be able to guarantee that you can keep those four accesses in
> > > > > cache all the time. Even if they all were to hit in the same set.
> > > > >
> > > > > So forward progress guarantees is one very basic requirement for it being easy to use, but it also
> > > > > means that the size of the transaction has to be small. Definitely no more "random C code".
> > > > >
> > > > > Just like LL/SC.
> > > > >
> > > > > The other "make it actually easy to use for real" requirement is no asynchronous aborts.
> > > > > You can have a hard exception on some "hey, you tried to do more than N accesses", but
> > > > > that's a software bug, it wouldn't be a "let's retry without using transactions".
> > > > >
> > > > > IOW, really make it act like LL/SC - an LL doesn't need to save state for recovery, and a SC failure
> > > > > doesn't "abort", it doesn't go to some other code sequence, it just writes a value to a register that
> > > > > the user can test for "did this sequence of writes complete fully or not", and then just repeat.
> > > > >
> > > > > And because forward progress is guaranteed, SW doesn't need to have fallback
> > > > > code, so it really just becomes that "repeat on failure" loop.
> > > > >
> > > > > Just like LL/SC.
> > > > >
> > > > > I'll take something simple and reliable (and reliably fast) over the HTM mess every time.
> > > > >
> > > > > In the kernel, we already use a very special version of "N=2" by just putting the lock and the data
> > > > > structure it protects next to each other, so that we can use a double-wide atomic access do do them
> > > > > both in one go. But it requires some very special data structures to do that, and requires that
> > > > > the lock has been split out to be a per-data thing. Which is not realistic for most things.
> > > > >
> > > > > And sure, a compiler could still recognize simple patterns and turn them into LLx/SCx
> > > > > sequences. But more likely, it would all be in libraries and language runtimes.
> > > >
> > > >
> > > > Constrained transactions on Z do a lot of that. They guarantee eventual completion, with the processor
> > > > doing things as necessary to boost the chances of success on retries. The size of constrained transactions
> > > > is more limited (no more than 32 instructions, data accesses to no more than four cache lines), and
> > > > some additional instructions are restricted, as are operands. But with some limitations, the transaction
> > > > can just be retried until it completes, and no fallback code is required. The retry is pretty simple
> > > > too, basically branch back to the TBEGINC based on the condition code.
> > >
> > > And those are being removed too.
> > >
> > > I think the reality is that the memory pipeline and cache
> > > coherency is a much more difficult problem than software
> > > developers understand. "We can take two nested locks so you should be able to atomically push stores to two
> > > cache lines, just use our unfathomably intelligent technique that nobody else ever would have thought about
> > > of using cache line address to sort the arbitration ordering" probably doesn't quite cut the mustard.
> >
> > However difficult it is I'm sure it's less of a minefield rhan the thousands of sharp knives
> > tipped with agonizing poison that is lock-free programming.
>
> Maybe, maybe not. Some lock-free programming is very simple. And a lot of it should not be overly-complex.
> Look at the amount of RCU usage and memory barriers in Linux kernel for example. Yes some of those
> are fiendishly complicated. Most are pretty simple canned list or hash lookups spread throughout
> the outer reaches of driver code that thousands of people work on and don't get a huge amount
> of attention, and Linux has not collapsed in a heap of unmaintainability.
>
> With a memory pipeline and cache coherency protocol... well, even making it work and scale with atomic operations
> on only a single cache line is difficult enough. Adding more lines is not just more of the same, there's likely
> completely new constraints and difficulties. Particularly on a CPU like Z where crashing or corrupting data
> is practically a non-option. Apparently they don't completely release stores until they're committed to redundant
> DRAM, and the CPU can basically stop at any point and have its checkpointed state moved and restarted on a
> different CPU. All without the SMP bus timing out so long that other CPUs crash...
>
> Clearly they can do it, because they did. Also clear that it's a big effort that's not just a matter of
> writing a bit of logic and forgetting about it, but is likely to require highly complicated consideration
> and verification when changing any logic involving memory and coherency. I don't think Z would choose remove
> the feature easily unless it was a large burden, even if it had not proven to be highly useful.
You have a good point there. I can't say I'm exactly enthused by the memory barriers implicit in the read or write of an address in RCU but I guess Linus would go on about the Alpha :-) Anyway I guess a load acquire and store release would fix that if it really caused problems.
The error detection and correction in the IBM mainframes is quite awesome!
> dmcq (dmcq.delete@this.fano.co.uk) on August 22, 2022 7:14 am wrote:
> > anon2 (anon.delete@this.anon.com) on August 21, 2022 9:31 pm wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on August 21, 2022 8:02 pm wrote:
> > > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 21, 2022 6:58 pm wrote:
> > > > > --- (---.delete@this.redheron.com) on August 21, 2022 1:26 pm wrote:
> > > > > > What you lose is the SW ease of HTM,
> > > > >
> > > > > "SW ease of HTM" never existed.
> > > > >
> > > > > In fact, I would argue that one of the big failures of HTM was to aim for it being some
> > > > > kind of generic thing in the first place. That just made it big, intrusive, and fragile.
> > > > >
> > > > > Large transactions are a mistake. They are fundamentally
> > > > > fragile. It's like designing for a glass jaw: "this
> > > > > will go really well if all the stars align, but if they don't it's going to be expensive as heck".
> > > > >
> > > > > And the bigger the transaction, the bigger the cost of failure, but also the
> > > > > likelihood of failure goes up. A lot. And it goes up the most in the exact
> > > > > situation where you do not want it to go up - when there is contention.
> > > > >
> > > > > So I'd much rather have something small, simple, and predictably high-performance.
> > > > >
> > > > > IOW, I'd rather get something like perhaps just an extended load-linked / store-conditional model. Very
> > > > > much with the hardware guaranteed forward progress that LL/SC has. Just extended to a couple more accesses.
> > > > > Take LL/SC and make it LLx/SCx - and keep the forward progress guarantees. In hardware.
> > > > >
> > > > > And you can only give those forward progress guarantees if you end up just having code
> > > > > length limits, and seriously limit the number of accesses. And keep it simple enough that
> > > > > hardware people (and software people) can actually believe that you get it right.
> > > > >
> > > > > For example, forward progress guarantees almost certainly mean that you'd probably have to do
> > > > > something like order the accesses by physical address when you repeat, so that you can then actually
> > > > > force them to stay in cache until completion without getting into nasty deadlocks.
> > > > >
> > > > > Don't order things on the first try, probably not on the second, but if you've seen several
> > > > > failed sequences in a row with no successes, keep track of the addresses that are used, and
> > > > > lock them in cache (within those very strict limits you set), but in some physical address
> > > > > order so that you don't get ABBA deadlocks between different cores doing different accesses.
> > > > >
> > > > > With a small enough N, that should be eminently doable. But "small enough N" is small! We know N=1 works,
> > > > > but it's not good enough if you want to check a lock (one
> > > > > cacheline) and do a doubly linked list update (two
> > > > > more cachelines) at the same time. But maybe N=4 is big enough to be useful, but small enough that hardware
> > > > > can easily still deal with inconvenient ordering requirements and actually give forward progress guarantees.
> > > > >
> > > > > Also, just to hit N=4, you'd probably need your L1 cache to be at least 4-way associative,
> > > > > because you need to be able to guarantee that you can keep those four accesses in
> > > > > cache all the time. Even if they all were to hit in the same set.
> > > > >
> > > > > So forward progress guarantees is one very basic requirement for it being easy to use, but it also
> > > > > means that the size of the transaction has to be small. Definitely no more "random C code".
> > > > >
> > > > > Just like LL/SC.
> > > > >
> > > > > The other "make it actually easy to use for real" requirement is no asynchronous aborts.
> > > > > You can have a hard exception on some "hey, you tried to do more than N accesses", but
> > > > > that's a software bug, it wouldn't be a "let's retry without using transactions".
> > > > >
> > > > > IOW, really make it act like LL/SC - an LL doesn't need to save state for recovery, and a SC failure
> > > > > doesn't "abort", it doesn't go to some other code sequence, it just writes a value to a register that
> > > > > the user can test for "did this sequence of writes complete fully or not", and then just repeat.
> > > > >
> > > > > And because forward progress is guaranteed, SW doesn't need to have fallback
> > > > > code, so it really just becomes that "repeat on failure" loop.
> > > > >
> > > > > Just like LL/SC.
> > > > >
> > > > > I'll take something simple and reliable (and reliably fast) over the HTM mess every time.
> > > > >
> > > > > In the kernel, we already use a very special version of "N=2" by just putting the lock and the data
> > > > > structure it protects next to each other, so that we can use a double-wide atomic access do do them
> > > > > both in one go. But it requires some very special data structures to do that, and requires that
> > > > > the lock has been split out to be a per-data thing. Which is not realistic for most things.
> > > > >
> > > > > And sure, a compiler could still recognize simple patterns and turn them into LLx/SCx
> > > > > sequences. But more likely, it would all be in libraries and language runtimes.
> > > >
> > > >
> > > > Constrained transactions on Z do a lot of that. They guarantee eventual completion, with the processor
> > > > doing things as necessary to boost the chances of success on retries. The size of constrained transactions
> > > > is more limited (no more than 32 instructions, data accesses to no more than four cache lines), and
> > > > some additional instructions are restricted, as are operands. But with some limitations, the transaction
> > > > can just be retried until it completes, and no fallback code is required. The retry is pretty simple
> > > > too, basically branch back to the TBEGINC based on the condition code.
> > >
> > > And those are being removed too.
> > >
> > > I think the reality is that the memory pipeline and cache
> > > coherency is a much more difficult problem than software
> > > developers understand. "We can take two nested locks so you should be able to atomically push stores to two
> > > cache lines, just use our unfathomably intelligent technique that nobody else ever would have thought about
> > > of using cache line address to sort the arbitration ordering" probably doesn't quite cut the mustard.
> >
> > However difficult it is I'm sure it's less of a minefield rhan the thousands of sharp knives
> > tipped with agonizing poison that is lock-free programming.
>
> Maybe, maybe not. Some lock-free programming is very simple. And a lot of it should not be overly-complex.
> Look at the amount of RCU usage and memory barriers in Linux kernel for example. Yes some of those
> are fiendishly complicated. Most are pretty simple canned list or hash lookups spread throughout
> the outer reaches of driver code that thousands of people work on and don't get a huge amount
> of attention, and Linux has not collapsed in a heap of unmaintainability.
>
> With a memory pipeline and cache coherency protocol... well, even making it work and scale with atomic operations
> on only a single cache line is difficult enough. Adding more lines is not just more of the same, there's likely
> completely new constraints and difficulties. Particularly on a CPU like Z where crashing or corrupting data
> is practically a non-option. Apparently they don't completely release stores until they're committed to redundant
> DRAM, and the CPU can basically stop at any point and have its checkpointed state moved and restarted on a
> different CPU. All without the SMP bus timing out so long that other CPUs crash...
>
> Clearly they can do it, because they did. Also clear that it's a big effort that's not just a matter of
> writing a bit of logic and forgetting about it, but is likely to require highly complicated consideration
> and verification when changing any logic involving memory and coherency. I don't think Z would choose remove
> the feature easily unless it was a large burden, even if it had not proven to be highly useful.
You have a good point there. I can't say I'm exactly enthused by the memory barriers implicit in the read or write of an address in RCU but I guess Linus would go on about the Alpha :-) Anyway I guess a load acquire and store release would fix that if it really caused problems.
The error detection and correction in the IBM mainframes is quite awesome!