Hardware Transactional Memory, the end?

By: anon2 (anon.delete@this.anon.com), August 23, 2022 12:15 am
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on August 22, 2022 7:14 am wrote:
> anon2 (anon.delete@this.anon.com) on August 21, 2022 9:31 pm wrote:
> > rwessel (rwessel.delete@this.yahoo.com) on August 21, 2022 8:02 pm wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 21, 2022 6:58 pm wrote:
> > > > --- (---.delete@this.redheron.com) on August 21, 2022 1:26 pm wrote:
> > > > > What you lose is the SW ease of HTM,
> > > >
> > > > "SW ease of HTM" never existed.
> > > >
> > > > In fact, I would argue that one of the big failures of HTM was to aim for it being some
> > > > kind of generic thing in the first place. That just made it big, intrusive, and fragile.
> > > >
> > > > Large transactions are a mistake. They are fundamentally
> > > > fragile. It's like designing for a glass jaw: "this
> > > > will go really well if all the stars align, but if they don't it's going to be expensive as heck".
> > > >
> > > > And the bigger the transaction, the bigger the cost of failure, but also the
> > > > likelihood of failure goes up. A lot. And it goes up the most in the exact
> > > > situation where you do not want it to go up - when there is contention.
> > > >
> > > > So I'd much rather have something small, simple, and predictably high-performance.
> > > >
> > > > IOW, I'd rather get something like perhaps just an extended load-linked / store-conditional model. Very
> > > > much with the hardware guaranteed forward progress that LL/SC has. Just extended to a couple more accesses.
> > > > Take LL/SC and make it LLx/SCx - and keep the forward progress guarantees. In hardware.
> > > >
> > > > And you can only give those forward progress guarantees if you end up just having code
> > > > length limits, and seriously limit the number of accesses. And keep it simple enough that
> > > > hardware people (and software people) can actually believe that you get it right.
> > > >
> > > > For example, forward progress guarantees almost certainly mean that you'd probably have to do
> > > > something like order the accesses by physical address when you repeat, so that you can then actually
> > > > force them to stay in cache until completion without getting into nasty deadlocks.
> > > >
> > > > Don't order things on the first try, probably not on the second, but if you've seen several
> > > > failed sequences in a row with no successes, keep track of the addresses that are used, and
> > > > lock them in cache (within those very strict limits you set), but in some physical address
> > > > order so that you don't get ABBA deadlocks between different cores doing different accesses.
> > > >
> > > > With a small enough N, that should be eminently doable. But "small enough N" is small! We know N=1 works,
> > > > but it's not good enough if you want to check a lock (one
> > > > cacheline) and do a doubly linked list update (two
> > > > more cachelines) at the same time. But maybe N=4 is big enough to be useful, but small enough that hardware
> > > > can easily still deal with inconvenient ordering requirements and actually give forward progress guarantees.
> > > >
> > > > Also, just to hit N=4, you'd probably need your L1 cache to be at least 4-way associative,
> > > > because you need to be able to guarantee that you can keep those four accesses in
> > > > cache all the time. Even if they all were to hit in the same set.
> > > >
> > > > So forward progress guarantees is one very basic requirement for it being easy to use, but it also
> > > > means that the size of the transaction has to be small. Definitely no more "random C code".
> > > >
> > > > Just like LL/SC.
> > > >
> > > > The other "make it actually easy to use for real" requirement is no asynchronous aborts.
> > > > You can have a hard exception on some "hey, you tried to do more than N accesses", but
> > > > that's a software bug, it wouldn't be a "let's retry without using transactions".
> > > >
> > > > IOW, really make it act like LL/SC - an LL doesn't need to save state for recovery, and a SC failure
> > > > doesn't "abort", it doesn't go to some other code sequence, it just writes a value to a register that
> > > > the user can test for "did this sequence of writes complete fully or not", and then just repeat.
> > > >
> > > > And because forward progress is guaranteed, SW doesn't need to have fallback
> > > > code, so it really just becomes that "repeat on failure" loop.
> > > >
> > > > Just like LL/SC.
> > > >
> > > > I'll take something simple and reliable (and reliably fast) over the HTM mess every time.
> > > >
> > > > In the kernel, we already use a very special version of "N=2" by just putting the lock and the data
> > > > structure it protects next to each other, so that we can use a double-wide atomic access do do them
> > > > both in one go. But it requires some very special data structures to do that, and requires that
> > > > the lock has been split out to be a per-data thing. Which is not realistic for most things.
> > > >
> > > > And sure, a compiler could still recognize simple patterns and turn them into LLx/SCx
> > > > sequences. But more likely, it would all be in libraries and language runtimes.
> > >
> > >
> > > Constrained transactions on Z do a lot of that. They guarantee eventual completion, with the processor
> > > doing things as necessary to boost the chances of success on retries. The size of constrained transactions
> > > is more limited (no more than 32 instructions, data accesses to no more than four cache lines), and
> > > some additional instructions are restricted, as are operands. But with some limitations, the transaction
> > > can just be retried until it completes, and no fallback code is required. The retry is pretty simple
> > > too, basically branch back to the TBEGINC based on the condition code.
> >
> > And those are being removed too.
> >
> > I think the reality is that the memory pipeline and cache
> > coherency is a much more difficult problem than software
> > developers understand. "We can take two nested locks so you should be able to atomically push stores to two
> > cache lines, just use our unfathomably intelligent technique that nobody else ever would have thought about
> > of using cache line address to sort the arbitration ordering" probably doesn't quite cut the mustard.
>
> However difficult it is I'm sure it's less of a minefield rhan the thousands of sharp knives
> tipped with agonizing poison that is lock-free programming.

Maybe, maybe not. Some lock-free programming is very simple. And a lot of it should not be overly-complex. Look at the amount of RCU usage and memory barriers in Linux kernel for example. Yes some of those are fiendishly complicated. Most are pretty simple canned list or hash lookups spread throughout the outer reaches of driver code that thousands of people work on and don't get a huge amount of attention, and Linux has not collapsed in a heap of unmaintainability.

With a memory pipeline and cache coherency protocol... well, even making it work and scale with atomic operations on only a single cache line is difficult enough. Adding more lines is not just more of the same, there's likely completely new constraints and difficulties. Particularly on a CPU like Z where crashing or corrupting data is practically a non-option. Apparently they don't completely release stores until they're committed to redundant DRAM, and the CPU can basically stop at any point and have its checkpointed state moved and restarted on a different CPU. All without the SMP bus timing out so long that other CPUs crash...

Clearly they can do it, because they did. Also clear that it's a big effort that's not just a matter of writing a bit of logic and forgetting about it, but is likely to require highly complicated consideration and verification when changing any logic involving memory and coherency. I don't think Z would choose remove the feature easily unless it was a large burden, even if it had not proven to be highly useful.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Hardware Transactional Memory, the end?rwessel2022/08/20 06:50 PM
  Hardware Transactional Memory, the end?Kara2022/08/20 11:04 PM
    Hardware Transactional Memory, the end?dmcq2022/08/21 11:36 AM
      Hardware Transactional Memory, the end?rwessel2022/08/21 12:17 PM
        Hardware Transactional Memory, the end?---2022/08/21 01:26 PM
          Hardware Transactional Memory, the end?Andrey2022/08/21 06:39 PM
            Hardware Transactional Memory, the end?---2022/08/21 09:27 PM
              Hardware Transactional Memory, the end?Andrey2022/08/23 06:29 AM
                Hardware Transactional Memory, the end?---2022/08/23 10:00 AM
                  Hardware Transactional Memory, the end?iz2022/08/23 01:20 PM
                    Hardware Transactional Memory, the end?anonymou52022/08/23 02:57 PM
                  Hardware Transactional Memory, the end?Andrey2022/08/23 06:01 PM
                    Hardware Transactional Memory, the end?Anon2022/08/23 06:28 PM
                      Hardware Transactional Memory, the end?Andrey2022/08/24 04:10 AM
                        Hardware Transactional Memory, the end?Anon2022/08/24 08:50 AM
                          Hardware Transactional Memory, the end?rwessel2022/08/24 09:35 AM
                          Hardware Transactional Memory, the end?Etienne2022/08/25 01:54 AM
                            Hardware Transactional Memory, the end?Anon2022/08/25 05:25 AM
                              Hardware Transactional Memory, the end?Etienne2022/08/25 06:24 AM
                                Hardware Transactional Memory, the end?rwessel2022/08/25 08:16 AM
                                  Hardware Transactional Memory, the end?Linus Torvalds2022/08/25 10:16 AM
                                    Hardware Transactional Memory, the end?rwessel2022/08/25 11:00 AM
                                  Hardware Transactional Memory, the end?Etienne2022/08/26 12:54 PM
                    Hardware Transactional Memory, the end?anon22022/08/23 08:50 PM
                      Hardware Transactional Memory, the end?Andrey2022/08/24 03:54 AM
                        Hardware Transactional Memory, the end?anon22022/08/24 04:54 AM
                      Hardware Transactional Memory, the end?Simon Farnsworth2022/08/24 03:58 AM
                        Hardware Transactional Memory, the end?Konrad Schwarz2022/08/28 06:12 AM
          Hardware Transactional Memory, the end?Linus Torvalds2022/08/21 06:58 PM
            Hardware Transactional Memory, the end?rwessel2022/08/21 08:02 PM
              Hardware Transactional Memory, the end?anon22022/08/21 09:31 PM
                Hardware Transactional Memory, the end?dmcq2022/08/22 07:14 AM
                  Hardware Transactional Memory, the end?anon22022/08/23 12:15 AM
                    Hardware Transactional Memory, the end?dmcq2022/08/24 03:50 AM
                      Hardware Transactional Memory, the end?Linus Torvalds2022/08/24 11:56 AM
                        Hardware Transactional Memory, the end?dmcq2022/08/25 04:39 PM
            Hardware Transactional Memory, the end?---2022/08/21 09:33 PM
              Hardware Transactional Memory, the end?Linus Torvalds2022/08/22 11:32 AM
                Hardware Transactional Memory, the end?Anon2022/08/22 12:35 PM
                  Hardware Transactional Memory, the end?rwessel2022/08/22 04:47 PM
                Hardware Transactional Memory, the end?gpd2022/08/26 03:07 AM
                  Hardware Transactional Memory, the end?Michael S2022/08/26 03:46 AM
                    Hardware Transactional Memory, the end?Linus Torvalds2022/08/26 11:01 AM
                      Hardware Transactional Memory, the end?rwessel2022/08/26 06:08 PM
                        Hardware Transactional Memory, the end?anonymou52022/08/26 07:52 PM
  Hardware Transactional Memory, the end?zArchJon2022/08/24 10:12 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊