By: anon (anon.delete@this.anon.com), August 28, 2014 8:25 pm
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on August 26, 2014 11:51 pm wrote:
> anon (anon.delete@this.anon.com) on August 26, 2014 11:02 pm wrote:
> > Now that doesn't necessarily say anything about forward progress of a particular CPU, only
> > general forward progress. So a single CPU may get stuck forever if there is contention from
> > others. However I did get the idea that POWER implementations do provide for some fairness
> > and per-thread forward progress. Which would be a consequence of allowing cacheline to be
> > held exclusive for specific sequence of instructions between lwarx and stwcx.
> >
> What is likely happening in the actual hardware is that the hardware will temporarily pin the line in
> E/M state long enough for the LL/SC to complete which can be done if you can guarantee that the LL/SC
> will complete in limited finite time. What's like happening beyond that is forward state chaining by
> the coherency engines. AKA, 3 requests come in from Pa,Pb,Pc for line Lx. Pa is first and the coherency
> engine gives it the line first, but as soon as it acknowledges state of the line, the coherency engine
> sends it a FWD(Lx,Pb) and similarly as soon as Pb acks the line state, it gets a FWD(Lx,Pc).
>
> Pretty much all interconnection network based coherent system have a single coherency master per line(but
> generally multiple masters split by address space). And, as far as I'm aware, they all support E/M state
> FWDing and chaining as its a pretty common case. Now old school master-less bus based systems or a nack
> based interconnection network coherence system can run into starvation issues and require extra steps
> to break out of it. For nack/bus based systems, this is because instead of queuing (either in a dedicated
> ordered queue or on the network), it basically throws a retry for the requests it doesn't accept. Though
> I'm pretty sure there aren't any modern coherence systems left that are nack/retry based as modern hardware
> can easily support either in network or at CC queuing for all requests. EV7 used a combination of in
> network and at CC queuing and something like QPI uses full pre-allocation.
Thank you for this explanation.
> > I'm not sure if I understand you correctly. I think I failed to explain myself properly: the CAS
> > problem I heard of is not due to the instruction itself failing to make forward progress, or the
> > entire system failing to make forward progress, but due to the software constructs that use it,
> > failing to be able to achieve a livelock-free sequence of instructions on a per-thread basis.
> >
>
> Ah, so you are talking about the Fetch_and_Add case.
Well that was just a relatively simple example.
*Most* of the usecases for CAS that I have seen involve an initial load, a simple comparison and/or computation, then the CAS -- essentially following the pattern of an LL/SC. These all have the same potential starvation issue.
Now it is possible to use something like a "blind" CAS. For example acquiring a fine grained lock or refcount, you might be able to assume that it is in the uncondtended state and be correct often enough that the optimization is worthwhile. But I have seen very limited application of that.
> Well most CAS based coherence
> systems have FTCH_ADD as a standard primitive so won't ever run into an issue.
x86 does. x86 has a pretty decent range of RMW primitives - add/sub/xchg/cmpxchg/inc/dec (where inc and dec also include ability to check for zero and such).
But that might take you, spiritally, into CISC territory. Not that there is anything wrong with that for x86, or even prohibitive for a modern RISC, but it's a number of microcoded instructions that can be achieved with LL/SC. Although quite possibly in a more compact way that results in less macro and micro instructions at least for common simple cases of inc/dec/etc. That said, I would rather have atomic add/sub + LL/SC than have atomic add/sub + CAS.
What else?
SPARC has only CAS. zArch I know very little about but it seems to have only CAS.
Are there other architectures which don't have LL/SC? Oh right, Itanium yes? PARISC has some abomination called "load and clear word" only...
>
> > Let's take the simple case of an atomic increment. A CAS architecture will have to do something like:
> >
> > retry:
> > LD r1 := [mem]
> > ADD r1 := r1 + 1
> > CAS [mem], r1
> > CMP failure case
> > BEQ retry
> >
> > Now unless the ISA or implementation also defines very specific sequences leading up to the CAS instruction
> > as being "special", with respect to forward progress of the thread, then it's possible that other CPUs
> > will always modify the location between the LD and the CAS, and thus this thread always fails.
> >
> This is basically a bad way to use CAS.
Well, what is a good way to use it?
>
>
> > LL/SC can solve that problem in hardware. A CAS architecture could too, but it would
> > either need some fancy detection/prediction and treatment of LD that comes ahead of
> > the CAS, or it needs an LD variant which is to be used before the CAS, with particular
> > limited instructions between (in which case you're basically back to LL/SC).
> >
> > Not that it's a canonical definition of the ISA or any
> > particular implementation, but Linux's SPARC port has:
> >
> >
> >
> > Where BACKOFF* are defined for SMP kernels as doing some
> > amount of busy waiting before trying the sequence again.
> >
> Yeah this is a case of basically a detriment in the SPARC
> ISA where they used CAS but didn't also use FTCH_ADD.
> anon (anon.delete@this.anon.com) on August 26, 2014 11:02 pm wrote:
> > Now that doesn't necessarily say anything about forward progress of a particular CPU, only
> > general forward progress. So a single CPU may get stuck forever if there is contention from
> > others. However I did get the idea that POWER implementations do provide for some fairness
> > and per-thread forward progress. Which would be a consequence of allowing cacheline to be
> > held exclusive for specific sequence of instructions between lwarx and stwcx.
> >
> What is likely happening in the actual hardware is that the hardware will temporarily pin the line in
> E/M state long enough for the LL/SC to complete which can be done if you can guarantee that the LL/SC
> will complete in limited finite time. What's like happening beyond that is forward state chaining by
> the coherency engines. AKA, 3 requests come in from Pa,Pb,Pc for line Lx. Pa is first and the coherency
> engine gives it the line first, but as soon as it acknowledges state of the line, the coherency engine
> sends it a FWD(Lx,Pb) and similarly as soon as Pb acks the line state, it gets a FWD(Lx,Pc).
>
> Pretty much all interconnection network based coherent system have a single coherency master per line(but
> generally multiple masters split by address space). And, as far as I'm aware, they all support E/M state
> FWDing and chaining as its a pretty common case. Now old school master-less bus based systems or a nack
> based interconnection network coherence system can run into starvation issues and require extra steps
> to break out of it. For nack/bus based systems, this is because instead of queuing (either in a dedicated
> ordered queue or on the network), it basically throws a retry for the requests it doesn't accept. Though
> I'm pretty sure there aren't any modern coherence systems left that are nack/retry based as modern hardware
> can easily support either in network or at CC queuing for all requests. EV7 used a combination of in
> network and at CC queuing and something like QPI uses full pre-allocation.
Thank you for this explanation.
> > I'm not sure if I understand you correctly. I think I failed to explain myself properly: the CAS
> > problem I heard of is not due to the instruction itself failing to make forward progress, or the
> > entire system failing to make forward progress, but due to the software constructs that use it,
> > failing to be able to achieve a livelock-free sequence of instructions on a per-thread basis.
> >
>
> Ah, so you are talking about the Fetch_and_Add case.
Well that was just a relatively simple example.
*Most* of the usecases for CAS that I have seen involve an initial load, a simple comparison and/or computation, then the CAS -- essentially following the pattern of an LL/SC. These all have the same potential starvation issue.
Now it is possible to use something like a "blind" CAS. For example acquiring a fine grained lock or refcount, you might be able to assume that it is in the uncondtended state and be correct often enough that the optimization is worthwhile. But I have seen very limited application of that.
> Well most CAS based coherence
> systems have FTCH_ADD as a standard primitive so won't ever run into an issue.
x86 does. x86 has a pretty decent range of RMW primitives - add/sub/xchg/cmpxchg/inc/dec (where inc and dec also include ability to check for zero and such).
But that might take you, spiritally, into CISC territory. Not that there is anything wrong with that for x86, or even prohibitive for a modern RISC, but it's a number of microcoded instructions that can be achieved with LL/SC. Although quite possibly in a more compact way that results in less macro and micro instructions at least for common simple cases of inc/dec/etc. That said, I would rather have atomic add/sub + LL/SC than have atomic add/sub + CAS.
What else?
SPARC has only CAS. zArch I know very little about but it seems to have only CAS.
Are there other architectures which don't have LL/SC? Oh right, Itanium yes? PARISC has some abomination called "load and clear word" only...
>
> > Let's take the simple case of an atomic increment. A CAS architecture will have to do something like:
> >
> > retry:
> > LD r1 := [mem]
> > ADD r1 := r1 + 1
> > CAS [mem], r1
> > CMP failure case
> > BEQ retry
> >
> > Now unless the ISA or implementation also defines very specific sequences leading up to the CAS instruction
> > as being "special", with respect to forward progress of the thread, then it's possible that other CPUs
> > will always modify the location between the LD and the CAS, and thus this thread always fails.
> >
> This is basically a bad way to use CAS.
Well, what is a good way to use it?
>
>
> > LL/SC can solve that problem in hardware. A CAS architecture could too, but it would
> > either need some fancy detection/prediction and treatment of LD that comes ahead of
> > the CAS, or it needs an LD variant which is to be used before the CAS, with particular
> > limited instructions between (in which case you're basically back to LL/SC).
> >
> > Not that it's a canonical definition of the ISA or any
> > particular implementation, but Linux's SPARC port has:
> >
> >
> > .globl atomic_add
> > .type atomic_add,#function
> > atomic_add: /* %o0 = increment, %o1 = atomic_ptr */
> > BACKOFF_SETUP(%o2)
> > 1: lduw [%o1], %g1
> > add %g1, %o0, %g7
> > cas [%o1], %g1, %g7
> > cmp %g1, %g7
> > bne,pn %icc, BACKOFF_LABEL(2f, 1b)
> > nop
> > retl
> > nop
> > 2: BACKOFF_SPIN(%o2, %o3, 1b)
> > .size atomic_add, .-atomic_add
> >
> >
> > Where BACKOFF* are defined for SMP kernels as doing some
> > amount of busy waiting before trying the sequence again.
> >
> Yeah this is a case of basically a detriment in the SPARC
> ISA where they used CAS but didn't also use FTCH_ADD.