By: --- (---.delete@this.redheron.com), August 21, 2022 9:33 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 21, 2022 6:58 pm wrote:
> --- (---.delete@this.redheron.com) on August 21, 2022 1:26 pm wrote:
> > What you lose is the SW ease of HTM,
>
> "SW ease of HTM" never existed.
>
> In fact, I would argue that one of the big failures of HTM was to aim for it being some
> kind of generic thing in the first place. That just made it big, intrusive, and fragile.
>
> Large transactions are a mistake. They are fundamentally fragile. It's like designing for a glass jaw: "this
> will go really well if all the stars align, but if they don't it's going to be expensive as heck".
>
> And the bigger the transaction, the bigger the cost of failure, but also the
> likelihood of failure goes up. A lot. And it goes up the most in the exact
> situation where you do not want it to go up - when there is contention.
>
> So I'd much rather have something small, simple, and predictably high-performance.
>
> IOW, I'd rather get something like perhaps just an extended load-linked / store-conditional model. Very
> much with the hardware guaranteed forward progress that LL/SC has. Just extended to a couple more accesses.
> Take LL/SC and make it LLx/SCx - and keep the forward progress guarantees. In hardware.
>
> And you can only give those forward progress guarantees if you end up just having code
> length limits, and seriously limit the number of accesses. And keep it simple enough that
> hardware people (and software people) can actually believe that you get it right.
>
> For example, forward progress guarantees almost certainly mean that you'd probably have to do
> something like order the accesses by physical address when you repeat, so that you can then actually
> force them to stay in cache until completion without getting into nasty deadlocks.
>
> Don't order things on the first try, probably not on the second, but if you've seen several
> failed sequences in a row with no successes, keep track of the addresses that are used, and
> lock them in cache (within those very strict limits you set), but in some physical address
> order so that you don't get ABBA deadlocks between different cores doing different accesses.
>
> With a small enough N, that should be eminently doable. But "small enough N" is small! We know N=1 works,
> but it's not good enough if you want to check a lock (one cacheline) and do a doubly linked list update (two
> more cachelines) at the same time. But maybe N=4 is big enough to be useful, but small enough that hardware
> can easily still deal with inconvenient ordering requirements and actually give forward progress guarantees.
>
> Also, just to hit N=4, you'd probably need your L1 cache to be at least 4-way associative,
> because you need to be able to guarantee that you can keep those four accesses in
> cache all the time. Even if they all were to hit in the same set.
>
> So forward progress guarantees is one very basic requirement for it being easy to use, but it also
> means that the size of the transaction has to be small. Definitely no more "random C code".
>
> Just like LL/SC.
>
> The other "make it actually easy to use for real" requirement is no asynchronous aborts.
> You can have a hard exception on some "hey, you tried to do more than N accesses", but
> that's a software bug, it wouldn't be a "let's retry without using transactions".
>
> IOW, really make it act like LL/SC - an LL doesn't need to save state for recovery, and a SC failure
> doesn't "abort", it doesn't go to some other code sequence, it just writes a value to a register that
> the user can test for "did this sequence of writes complete fully or not", and then just repeat.
>
> And because forward progress is guaranteed, SW doesn't need to have fallback
> code, so it really just becomes that "repeat on failure" loop.
>
> Just like LL/SC.
>
> I'll take something simple and reliable (and reliably fast) over the HTM mess every time.
>
> In the kernel, we already use a very special version of "N=2" by just putting the lock and the data
> structure it protects next to each other, so that we can use a double-wide atomic access do do them
> both in one go. But it requires some very special data structures to do that, and requires that
> the lock has been split out to be a per-data thing. Which is not realistic for most things.
>
> And sure, a compiler could still recognize simple patterns and turn them into LLx/SCx
> sequences. But more likely, it would all be in libraries and language runtimes.
>
> Linus
Just so I understand, Linus, are you saying that
(a) You NEED LL4/SC4 for correctness? That is, if you can trust LL4/SC4 to do what's promised, various constructions that today are very messy/finicky/slow can be replaced with something much simpler
OR
(b) are you saying that if something that could speculate across say 4 atomics that would give you what you wanted? In other words, it's not a question of replacing messy/finicky code, it's just a question of performance, but if speculation were to extend well from one atomic to two nearby atomics to four nearby atomics, at each stage there'd be a nice performance boost?
> --- (---.delete@this.redheron.com) on August 21, 2022 1:26 pm wrote:
> > What you lose is the SW ease of HTM,
>
> "SW ease of HTM" never existed.
>
> In fact, I would argue that one of the big failures of HTM was to aim for it being some
> kind of generic thing in the first place. That just made it big, intrusive, and fragile.
>
> Large transactions are a mistake. They are fundamentally fragile. It's like designing for a glass jaw: "this
> will go really well if all the stars align, but if they don't it's going to be expensive as heck".
>
> And the bigger the transaction, the bigger the cost of failure, but also the
> likelihood of failure goes up. A lot. And it goes up the most in the exact
> situation where you do not want it to go up - when there is contention.
>
> So I'd much rather have something small, simple, and predictably high-performance.
>
> IOW, I'd rather get something like perhaps just an extended load-linked / store-conditional model. Very
> much with the hardware guaranteed forward progress that LL/SC has. Just extended to a couple more accesses.
> Take LL/SC and make it LLx/SCx - and keep the forward progress guarantees. In hardware.
>
> And you can only give those forward progress guarantees if you end up just having code
> length limits, and seriously limit the number of accesses. And keep it simple enough that
> hardware people (and software people) can actually believe that you get it right.
>
> For example, forward progress guarantees almost certainly mean that you'd probably have to do
> something like order the accesses by physical address when you repeat, so that you can then actually
> force them to stay in cache until completion without getting into nasty deadlocks.
>
> Don't order things on the first try, probably not on the second, but if you've seen several
> failed sequences in a row with no successes, keep track of the addresses that are used, and
> lock them in cache (within those very strict limits you set), but in some physical address
> order so that you don't get ABBA deadlocks between different cores doing different accesses.
>
> With a small enough N, that should be eminently doable. But "small enough N" is small! We know N=1 works,
> but it's not good enough if you want to check a lock (one cacheline) and do a doubly linked list update (two
> more cachelines) at the same time. But maybe N=4 is big enough to be useful, but small enough that hardware
> can easily still deal with inconvenient ordering requirements and actually give forward progress guarantees.
>
> Also, just to hit N=4, you'd probably need your L1 cache to be at least 4-way associative,
> because you need to be able to guarantee that you can keep those four accesses in
> cache all the time. Even if they all were to hit in the same set.
>
> So forward progress guarantees is one very basic requirement for it being easy to use, but it also
> means that the size of the transaction has to be small. Definitely no more "random C code".
>
> Just like LL/SC.
>
> The other "make it actually easy to use for real" requirement is no asynchronous aborts.
> You can have a hard exception on some "hey, you tried to do more than N accesses", but
> that's a software bug, it wouldn't be a "let's retry without using transactions".
>
> IOW, really make it act like LL/SC - an LL doesn't need to save state for recovery, and a SC failure
> doesn't "abort", it doesn't go to some other code sequence, it just writes a value to a register that
> the user can test for "did this sequence of writes complete fully or not", and then just repeat.
>
> And because forward progress is guaranteed, SW doesn't need to have fallback
> code, so it really just becomes that "repeat on failure" loop.
>
> Just like LL/SC.
>
> I'll take something simple and reliable (and reliably fast) over the HTM mess every time.
>
> In the kernel, we already use a very special version of "N=2" by just putting the lock and the data
> structure it protects next to each other, so that we can use a double-wide atomic access do do them
> both in one go. But it requires some very special data structures to do that, and requires that
> the lock has been split out to be a per-data thing. Which is not realistic for most things.
>
> And sure, a compiler could still recognize simple patterns and turn them into LLx/SCx
> sequences. But more likely, it would all be in libraries and language runtimes.
>
> Linus
Just so I understand, Linus, are you saying that
(a) You NEED LL4/SC4 for correctness? That is, if you can trust LL4/SC4 to do what's promised, various constructions that today are very messy/finicky/slow can be replaced with something much simpler
OR
(b) are you saying that if something that could speculate across say 4 atomics that would give you what you wanted? In other words, it's not a question of replacing messy/finicky code, it's just a question of performance, but if speculation were to extend well from one atomic to two nearby atomics to four nearby atomics, at each stage there'd be a nice performance boost?