By: rwessel (rwessel.delete@this.yahoo.com), August 25, 2022 11:00 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 25, 2022 10:16 am wrote:
> rwessel (rwessel.delete@this.yahoo.com) on August 25, 2022 8:16 am wrote:
> >
> > I'm not sure I understand this concern. The transaction appears (eventually) to all other cores
> > as a single atomic event. In practice another transaction (or a non-transactional access) can interfere,
> > and prevent completion, but they can't cause a deadlock.
>
> Sure, they can cause a deadlock (or more commonly it's really
> a livelock) if you do a simplistic implementation.
>
> In fact, you can very much get that livelock even with just a single cacheline with 'LL/SC', if
> your cache coherency isn't properly designed, so that the 'SC' always fails because other CPU's
> doing their own LL/SC loops always steal the cacheline back before the SC has succeeded.
>
> And no, that's not some theoretical worry. We literally saw a few of the (not great) early ARM
> server core designs with CPU starvation in LL/SC loops when we had atomics share cachelines with
> some very hot other data where lots of other CPU's would hammer on it with regular stores.
>
> Imagine having the NMI watchdog fire and the machine reboot because LL/SC doesn't make sufficient progress
> because others can saturate your cache coherency traffic in user space, and the chip was designed to
> be some flock-of-chickens thing with lots of cores but not the infrastructure to match?
>
> And that's with just a single cacheline needed to complete.
>
> When you have a transaction with multiple cachelines, you can have CPU1 want to load cachelines
> exclusively in order A->B, and another one want to load them in order B->A.
>
> Yes, both CPU's do that in order to expose one atomic sequence, so the ordering
> isn't visible from the end result, but in order to get that atomic sequence
> both CPU's want both cachelines to be exclusively in their caches.
>
> And if the cache protocol isn't well enough designed, they will either get into a deadlock where on
> CPU has one cacheline and wants the other, and the other CPU has that second cacheline but wants the
> first. And neither wants to release it because they need it for the transaction to complete.
>
> More commonly, you don't get a deadlock, but you get the above kind of livelock, where both CPU's
> hold on to the cacheline they have for a while, but then abort and retry - and re-create the
> same issue. Possibly forever, or possibly just for a long time until some other action ends up
> perturbing the timing enough that one CPU finally gets both, and they get to continue.
>
> And before you say it, yes, the solution is trivial. With only a small set of cachelines, just
> ordering them by some shared ordering (typically physical address, but it could be anything else)
> will fix it. And sorting things really is simple when "things" is just a couple of elements.
>
> Of course, even that "trivial" thing is them complicated by the fact that it affects almost the whole
> memory pipeline - the atomic sequence itself will be handled by the core in the L1 cache, but now that
> it requires ordering on the bus you need to really have the logic permeate all the way out...
>
> And considering that people have gotten cache coherency wrong even when N is just one, it's most
> certainly understandable that people would worry about this. And that memory pipeline is most definitely
> the thing most people worry most about even in the absence of any of these kinds of issues.
>
> Side note: you can most definitely get the same kinds of things with a single memory access:
> x86 allows atomic locked instructions that cross cache line - or even page - boundaries, and
> their CPU designers hate the complexity just that causes (Intel calls it "split locks").
>
> It gets even worse when one (or both) of those sides is not regular RAM and doesn't
> even have a cacheline, so the whole "keep cacheline around" approach to make
> it look atomic even if it wasn't really atomic internally doesn't work.
>
> Intel historically solve that with a single global "lock pin", and it's horrendous from a performance and scalability
> issue). And while I'm sure it is solvable other ways, Intel actually is trying to make locked cacheline crossers
> fault so that you can find software that does this (hopefully unintentionally) and just fix it.
>
> The good news is that people used to have so many problems with subtle cache consistency bugs
> that these days I think all CPU designers will basically use automated proofs of correctness
> for the cache protocol. Then you still rely on your cache consistency model that you proved
> to be correct actually matching what the hardware does, but hey, that's a different issue.
My point was that there's nothing inherent in the concept of a transaction that's guaranteed to succeed* eventually that requires the possibility of a dead (or live) lock. You can't make the guarantee for a single execution of the transaction, it may require a series of aborts and retries, during which some extreme coping mechanisms may need to be brought into play.
Bad implementations are a different issue.
*Where success is possible - taking a protection fault in the middle of the transaction will obviously prevent it from completing (but that's going to take out the process as well).
> rwessel (rwessel.delete@this.yahoo.com) on August 25, 2022 8:16 am wrote:
> >
> > I'm not sure I understand this concern. The transaction appears (eventually) to all other cores
> > as a single atomic event. In practice another transaction (or a non-transactional access) can interfere,
> > and prevent completion, but they can't cause a deadlock.
>
> Sure, they can cause a deadlock (or more commonly it's really
> a livelock) if you do a simplistic implementation.
>
> In fact, you can very much get that livelock even with just a single cacheline with 'LL/SC', if
> your cache coherency isn't properly designed, so that the 'SC' always fails because other CPU's
> doing their own LL/SC loops always steal the cacheline back before the SC has succeeded.
>
> And no, that's not some theoretical worry. We literally saw a few of the (not great) early ARM
> server core designs with CPU starvation in LL/SC loops when we had atomics share cachelines with
> some very hot other data where lots of other CPU's would hammer on it with regular stores.
>
> Imagine having the NMI watchdog fire and the machine reboot because LL/SC doesn't make sufficient progress
> because others can saturate your cache coherency traffic in user space, and the chip was designed to
> be some flock-of-chickens thing with lots of cores but not the infrastructure to match?
>
> And that's with just a single cacheline needed to complete.
>
> When you have a transaction with multiple cachelines, you can have CPU1 want to load cachelines
> exclusively in order A->B, and another one want to load them in order B->A.
>
> Yes, both CPU's do that in order to expose one atomic sequence, so the ordering
> isn't visible from the end result, but in order to get that atomic sequence
> both CPU's want both cachelines to be exclusively in their caches.
>
> And if the cache protocol isn't well enough designed, they will either get into a deadlock where on
> CPU has one cacheline and wants the other, and the other CPU has that second cacheline but wants the
> first. And neither wants to release it because they need it for the transaction to complete.
>
> More commonly, you don't get a deadlock, but you get the above kind of livelock, where both CPU's
> hold on to the cacheline they have for a while, but then abort and retry - and re-create the
> same issue. Possibly forever, or possibly just for a long time until some other action ends up
> perturbing the timing enough that one CPU finally gets both, and they get to continue.
>
> And before you say it, yes, the solution is trivial. With only a small set of cachelines, just
> ordering them by some shared ordering (typically physical address, but it could be anything else)
> will fix it. And sorting things really is simple when "things" is just a couple of elements.
>
> Of course, even that "trivial" thing is them complicated by the fact that it affects almost the whole
> memory pipeline - the atomic sequence itself will be handled by the core in the L1 cache, but now that
> it requires ordering on the bus you need to really have the logic permeate all the way out...
>
> And considering that people have gotten cache coherency wrong even when N is just one, it's most
> certainly understandable that people would worry about this. And that memory pipeline is most definitely
> the thing most people worry most about even in the absence of any of these kinds of issues.
>
> Side note: you can most definitely get the same kinds of things with a single memory access:
> x86 allows atomic locked instructions that cross cache line - or even page - boundaries, and
> their CPU designers hate the complexity just that causes (Intel calls it "split locks").
>
> It gets even worse when one (or both) of those sides is not regular RAM and doesn't
> even have a cacheline, so the whole "keep cacheline around" approach to make
> it look atomic even if it wasn't really atomic internally doesn't work.
>
> Intel historically solve that with a single global "lock pin", and it's horrendous from a performance and scalability
> issue). And while I'm sure it is solvable other ways, Intel actually is trying to make locked cacheline crossers
> fault so that you can find software that does this (hopefully unintentionally) and just fix it.
>
> The good news is that people used to have so many problems with subtle cache consistency bugs
> that these days I think all CPU designers will basically use automated proofs of correctness
> for the cache protocol. Then you still rely on your cache consistency model that you proved
> to be correct actually matching what the hardware does, but hey, that's a different issue.
My point was that there's nothing inherent in the concept of a transaction that's guaranteed to succeed* eventually that requires the possibility of a dead (or live) lock. You can't make the guarantee for a single execution of the transaction, it may require a series of aborts and retries, during which some extreme coping mechanisms may need to be brought into play.
Bad implementations are a different issue.
*Where success is possible - taking a protection fault in the middle of the transaction will obviously prevent it from completing (but that's going to take out the process as well).