By: Aaron Spink (aaronspink.delete@this.notearthlink.net), August 26, 2014 11:51 pm
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on August 26, 2014 11:02 pm wrote:
> Now that doesn't necessarily say anything about forward progress of a particular CPU, only
> general forward progress. So a single CPU may get stuck forever if there is contention from
> others. However I did get the idea that POWER implementations do provide for some fairness
> and per-thread forward progress. Which would be a consequence of allowing cacheline to be
> held exclusive for specific sequence of instructions between lwarx and stwcx.
>
What is likely happening in the actual hardware is that the hardware will temporarily pin the line in E/M state long enough for the LL/SC to complete which can be done if you can guarantee that the LL/SC will complete in limited finite time. What's like happening beyond that is forward state chaining by the coherency engines. AKA, 3 requests come in from Pa,Pb,Pc for line Lx. Pa is first and the coherency engine gives it the line first, but as soon as it acknowledges state of the line, the coherency engine sends it a FWD(Lx,Pb) and similarly as soon as Pb acks the line state, it gets a FWD(Lx,Pc).
Pretty much all interconnection network based coherent system have a single coherency master per line(but generally multiple masters split by address space). And, as far as I'm aware, they all support E/M state FWDing and chaining as its a pretty common case. Now old school master-less bus based systems or a nack based interconnection network coherence system can run into starvation issues and require extra steps to break out of it. For nack/bus based systems, this is because instead of queuing (either in a dedicated ordered queue or on the network), it basically throws a retry for the requests it doesn't accept. Though I'm pretty sure there aren't any modern coherence systems left that are nack/retry based as modern hardware can easily support either in network or at CC queuing for all requests. EV7 used a combination of in network and at CC queuing and something like QPI uses full pre-allocation.
> I'm not sure if I understand you correctly. I think I failed to explain myself properly: the CAS
> problem I heard of is not due to the instruction itself failing to make forward progress, or the
> entire system failing to make forward progress, but due to the software constructs that use it,
> failing to be able to achieve a livelock-free sequence of instructions on a per-thread basis.
>
Ah, so you are talking about the Fetch_and_Add case. Well most CAS based coherence systems have FTCH_ADD as a standard primitive so won't ever run into an issue.
> Let's take the simple case of an atomic increment. A CAS architecture will have to do something like:
>
> retry:
> LD r1 := [mem]
> ADD r1 := r1 + 1
> CAS [mem], r1
> CMP failure case
> BEQ retry
>
> Now unless the ISA or implementation also defines very specific sequences leading up to the CAS instruction
> as being "special", with respect to forward progress of the thread, then it's possible that other CPUs
> will always modify the location between the LD and the CAS, and thus this thread always fails.
>
This is basically a bad way to use CAS.
> LL/SC can solve that problem in hardware. A CAS architecture could too, but it would
> either need some fancy detection/prediction and treatment of LD that comes ahead of
> the CAS, or it needs an LD variant which is to be used before the CAS, with particular
> limited instructions between (in which case you're basically back to LL/SC).
>
> Not that it's a canonical definition of the ISA or any
> particular implementation, but Linux's SPARC port has:
>
>
>
> Where BACKOFF* are defined for SMP kernels as doing some
> amount of busy waiting before trying the sequence again.
>
Yeah this is a case of basically a detriment in the SPARC ISA where they used CAS but didn't also use FTCH_ADD.
> Now that doesn't necessarily say anything about forward progress of a particular CPU, only
> general forward progress. So a single CPU may get stuck forever if there is contention from
> others. However I did get the idea that POWER implementations do provide for some fairness
> and per-thread forward progress. Which would be a consequence of allowing cacheline to be
> held exclusive for specific sequence of instructions between lwarx and stwcx.
>
What is likely happening in the actual hardware is that the hardware will temporarily pin the line in E/M state long enough for the LL/SC to complete which can be done if you can guarantee that the LL/SC will complete in limited finite time. What's like happening beyond that is forward state chaining by the coherency engines. AKA, 3 requests come in from Pa,Pb,Pc for line Lx. Pa is first and the coherency engine gives it the line first, but as soon as it acknowledges state of the line, the coherency engine sends it a FWD(Lx,Pb) and similarly as soon as Pb acks the line state, it gets a FWD(Lx,Pc).
Pretty much all interconnection network based coherent system have a single coherency master per line(but generally multiple masters split by address space). And, as far as I'm aware, they all support E/M state FWDing and chaining as its a pretty common case. Now old school master-less bus based systems or a nack based interconnection network coherence system can run into starvation issues and require extra steps to break out of it. For nack/bus based systems, this is because instead of queuing (either in a dedicated ordered queue or on the network), it basically throws a retry for the requests it doesn't accept. Though I'm pretty sure there aren't any modern coherence systems left that are nack/retry based as modern hardware can easily support either in network or at CC queuing for all requests. EV7 used a combination of in network and at CC queuing and something like QPI uses full pre-allocation.
> I'm not sure if I understand you correctly. I think I failed to explain myself properly: the CAS
> problem I heard of is not due to the instruction itself failing to make forward progress, or the
> entire system failing to make forward progress, but due to the software constructs that use it,
> failing to be able to achieve a livelock-free sequence of instructions on a per-thread basis.
>
Ah, so you are talking about the Fetch_and_Add case. Well most CAS based coherence systems have FTCH_ADD as a standard primitive so won't ever run into an issue.
> Let's take the simple case of an atomic increment. A CAS architecture will have to do something like:
>
> retry:
> LD r1 := [mem]
> ADD r1 := r1 + 1
> CAS [mem], r1
> CMP failure case
> BEQ retry
>
> Now unless the ISA or implementation also defines very specific sequences leading up to the CAS instruction
> as being "special", with respect to forward progress of the thread, then it's possible that other CPUs
> will always modify the location between the LD and the CAS, and thus this thread always fails.
>
This is basically a bad way to use CAS.
> LL/SC can solve that problem in hardware. A CAS architecture could too, but it would
> either need some fancy detection/prediction and treatment of LD that comes ahead of
> the CAS, or it needs an LD variant which is to be used before the CAS, with particular
> limited instructions between (in which case you're basically back to LL/SC).
>
> Not that it's a canonical definition of the ISA or any
> particular implementation, but Linux's SPARC port has:
>
>
> .globl atomic_add
> .type atomic_add,#function
> atomic_add: /* %o0 = increment, %o1 = atomic_ptr */
> BACKOFF_SETUP(%o2)
> 1: lduw [%o1], %g1
> add %g1, %o0, %g7
> cas [%o1], %g1, %g7
> cmp %g1, %g7
> bne,pn %icc, BACKOFF_LABEL(2f, 1b)
> nop
> retl
> nop
> 2: BACKOFF_SPIN(%o2, %o3, 1b)
> .size atomic_add, .-atomic_add
>
>
> Where BACKOFF* are defined for SMP kernels as doing some
> amount of busy waiting before trying the sequence again.
>
Yeah this is a case of basically a detriment in the SPARC ISA where they used CAS but didn't also use FTCH_ADD.