By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), August 25, 2022 10:16 am
Room: Moderated Discussions
rwessel (rwessel.delete@this.yahoo.com) on August 25, 2022 8:16 am wrote:
>
> I'm not sure I understand this concern. The transaction appears (eventually) to all other cores
> as a single atomic event. In practice another transaction (or a non-transactional access) can interfere,
> and prevent completion, but they can't cause a deadlock.
Sure, they can cause a deadlock (or more commonly it's really a livelock) if you do a simplistic implementation.
In fact, you can very much get that livelock even with just a single cacheline with 'LL/SC', if your cache coherency isn't properly designed, so that the 'SC' always fails because other CPU's doing their own LL/SC loops always steal the cacheline back before the SC has succeeded.
And no, that's not some theoretical worry. We literally saw a few of the (not great) early ARM server core designs with CPU starvation in LL/SC loops when we had atomics share cachelines with some very hot other data where lots of other CPU's would hammer on it with regular stores.
Imagine having the NMI watchdog fire and the machine reboot because LL/SC doesn't make sufficient progress because others can saturate your cache coherency traffic in user space, and the chip was designed to be some flock-of-chickens thing with lots of cores but not the infrastructure to match?
And that's with just a single cacheline needed to complete.
When you have a transaction with multiple cachelines, you can have CPU1 want to load cachelines exclusively in order A->B, and another one want to load them in order B->A.
Yes, both CPU's do that in order to expose one atomic sequence, so the ordering isn't visible from the end result, but in order to get that atomic sequence both CPU's want both cachelines to be exclusively in their caches.
And if the cache protocol isn't well enough designed, they will either get into a deadlock where on CPU has one cacheline and wants the other, and the other CPU has that second cacheline but wants the first. And neither wants to release it because they need it for the transaction to complete.
More commonly, you don't get a deadlock, but you get the above kind of livelock, where both CPU's hold on to the cacheline they have for a while, but then abort and retry - and re-create the same issue. Possibly forever, or possibly just for a long time until some other action ends up perturbing the timing enough that one CPU finally gets both, and they get to continue.
And before you say it, yes, the solution is trivial. With only a small set of cachelines, just ordering them by some shared ordering (typically physical address, but it could be anything else) will fix it. And sorting things really is simple when "things" is just a couple of elements.
Of course, even that "trivial" thing is them complicated by the fact that it affects almost the whole memory pipeline - the atomic sequence itself will be handled by the core in the L1 cache, but now that it requires ordering on the bus you need to really have the logic permeate all the way out...
And considering that people have gotten cache coherency wrong even when N is just one, it's most certainly understandable that people would worry about this. And that memory pipeline is most definitely the thing most people worry most about even in the absence of any of these kinds of issues.
Side note: you can most definitely get the same kinds of things with a single memory access: x86 allows atomic locked instructions that cross cache line - or even page - boundaries, and their CPU designers hate the complexity just that causes (Intel calls it "split locks").
It gets even worse when one (or both) of those sides is not regular RAM and doesn't even have a cacheline, so the whole "keep cacheline around" approach to make it look atomic even if it wasn't really atomic internally doesn't work.
Intel historically solve that with a single global "lock pin", and it's horrendous from a performance and scalability issue). And while I'm sure it is solvable other ways, Intel actually is trying to make locked cacheline crossers fault so that you can find software that does this (hopefully unintentionally) and just fix it.
The good news is that people used to have so many problems with subtle cache consistency bugs that these days I think all CPU designers will basically use automated proofs of correctness for the cache protocol. Then you still rely on your cache consistency model that you proved to be correct actually matching what the hardware does, but hey, that's a different issue.
Linus
>
> I'm not sure I understand this concern. The transaction appears (eventually) to all other cores
> as a single atomic event. In practice another transaction (or a non-transactional access) can interfere,
> and prevent completion, but they can't cause a deadlock.
Sure, they can cause a deadlock (or more commonly it's really a livelock) if you do a simplistic implementation.
In fact, you can very much get that livelock even with just a single cacheline with 'LL/SC', if your cache coherency isn't properly designed, so that the 'SC' always fails because other CPU's doing their own LL/SC loops always steal the cacheline back before the SC has succeeded.
And no, that's not some theoretical worry. We literally saw a few of the (not great) early ARM server core designs with CPU starvation in LL/SC loops when we had atomics share cachelines with some very hot other data where lots of other CPU's would hammer on it with regular stores.
Imagine having the NMI watchdog fire and the machine reboot because LL/SC doesn't make sufficient progress because others can saturate your cache coherency traffic in user space, and the chip was designed to be some flock-of-chickens thing with lots of cores but not the infrastructure to match?
And that's with just a single cacheline needed to complete.
When you have a transaction with multiple cachelines, you can have CPU1 want to load cachelines exclusively in order A->B, and another one want to load them in order B->A.
Yes, both CPU's do that in order to expose one atomic sequence, so the ordering isn't visible from the end result, but in order to get that atomic sequence both CPU's want both cachelines to be exclusively in their caches.
And if the cache protocol isn't well enough designed, they will either get into a deadlock where on CPU has one cacheline and wants the other, and the other CPU has that second cacheline but wants the first. And neither wants to release it because they need it for the transaction to complete.
More commonly, you don't get a deadlock, but you get the above kind of livelock, where both CPU's hold on to the cacheline they have for a while, but then abort and retry - and re-create the same issue. Possibly forever, or possibly just for a long time until some other action ends up perturbing the timing enough that one CPU finally gets both, and they get to continue.
And before you say it, yes, the solution is trivial. With only a small set of cachelines, just ordering them by some shared ordering (typically physical address, but it could be anything else) will fix it. And sorting things really is simple when "things" is just a couple of elements.
Of course, even that "trivial" thing is them complicated by the fact that it affects almost the whole memory pipeline - the atomic sequence itself will be handled by the core in the L1 cache, but now that it requires ordering on the bus you need to really have the logic permeate all the way out...
And considering that people have gotten cache coherency wrong even when N is just one, it's most certainly understandable that people would worry about this. And that memory pipeline is most definitely the thing most people worry most about even in the absence of any of these kinds of issues.
Side note: you can most definitely get the same kinds of things with a single memory access: x86 allows atomic locked instructions that cross cache line - or even page - boundaries, and their CPU designers hate the complexity just that causes (Intel calls it "split locks").
It gets even worse when one (or both) of those sides is not regular RAM and doesn't even have a cacheline, so the whole "keep cacheline around" approach to make it look atomic even if it wasn't really atomic internally doesn't work.
Intel historically solve that with a single global "lock pin", and it's horrendous from a performance and scalability issue). And while I'm sure it is solvable other ways, Intel actually is trying to make locked cacheline crossers fault so that you can find software that does this (hopefully unintentionally) and just fix it.
The good news is that people used to have so many problems with subtle cache consistency bugs that these days I think all CPU designers will basically use automated proofs of correctness for the cache protocol. Then you still rely on your cache consistency model that you proved to be correct actually matching what the hardware does, but hey, that's a different issue.
Linus