By: anon (anon.delete@this.anon.com), August 26, 2014 11:02 pm
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on August 26, 2014 5:27 pm wrote:
> anon (anon.delete@this.anon.com) on August 26, 2014 6:06 am wrote:
> > I had the idea from somewhere that LL/SC in POWER CPUs had similar kinds of hardware guarantees
> > when used in very specific, limited sequences. That is, the hardware can take and hold the line
> > to avoid livelocks, will avoid state transitions, etc. I don't have a reference off the top of
> > my head (or the powerpc ISA manual handy to see what it says), so I could be wrong.
> >
> Basically all LL/SC architectures have put severe limits on the usage of LL/SC
> to get them to actually work in practice. Lots of things like no stores between
> LL/SC, limited number of instruction, limited number of loads, etcx.
Yes, the powerpc ISA manual says that implementations can provide guarantee for forward progress basically if they enumerate and allow software to avoid conditions that lead to reservation loss, but it's implementation specific.
I'm not quite sure where to find the docs for POWER[678etc], but, at risk of polluting the thread with too much unfounded claims, I recall the conditions are very strict. No other loads or stores, very limited number of instructions, branch limitation, etc.
Now that doesn't necessarily say anything about forward progress of a particular CPU, only general forward progress. So a single CPU may get stuck forever if there is contention from others. However I did get the idea that POWER implementations do provide for some fairness and per-thread forward progress. Which would be a consequence of allowing cacheline to be held exclusive for specific sequence of instructions between lwarx and stwcx.
>
>
> > In fact, in other architectures (e.g., SPARC), CAS I think has been a problem in the past with livelocks,
> > because of the common need to load the source data before the CAS. The advantage there of LL/SC
> > is that the LL could signal the core to load-exclusive and prepare for SC, etc. whereas LD/CAS may
> > be more difficult to optimize that first load and squash the livelocks in hardware.
> >
> This is likely the case of the double values load/swap. AKA, compare on X and swap
> Y.
I'm not sure if I understand you correctly. I think I failed to explain myself properly: the CAS problem I heard of is not due to the instruction itself failing to make forward progress, or the entire system failing to make forward progress, but due to the software constructs that use it, failing to be able to achieve a livelock-free sequence of instructions on a per-thread basis.
Let's take the simple case of an atomic increment. A CAS architecture will have to do something like:
retry:
LD r1 := [mem]
ADD r1 := r1 + 1
CAS [mem], r1
CMP failure case
BEQ retry
Now unless the ISA or implementation also defines very specific sequences leading up to the CAS instruction as being "special", with respect to forward progress of the thread, then it's possible that other CPUs will always modify the location between the LD and the CAS, and thus this thread always fails.
LL/SC can solve that problem in hardware. A CAS architecture could too, but it would either need some fancy detection/prediction and treatment of LD that comes ahead of the CAS, or it needs an LD variant which is to be used before the CAS, with particular limited instructions between (in which case you're basically back to LL/SC).
Not that it's a canonical definition of the ISA or any particular implementation, but Linux's SPARC port has:
Where BACKOFF* are defined for SMP kernels as doing some amount of busy waiting before trying the sequence again.
> Most CAS architectures have simply added double load CAS which handles this.
> Load X and Y from same 32 or 64 byte aligned space and swap one of them.
>
> None of the intrinsics handle multiline CAS like operations well, neither CAS nor LL/SC. Any time you
> get into multiple coherency transaction inter-context transactions you run into a lot of issues.
> anon (anon.delete@this.anon.com) on August 26, 2014 6:06 am wrote:
> > I had the idea from somewhere that LL/SC in POWER CPUs had similar kinds of hardware guarantees
> > when used in very specific, limited sequences. That is, the hardware can take and hold the line
> > to avoid livelocks, will avoid state transitions, etc. I don't have a reference off the top of
> > my head (or the powerpc ISA manual handy to see what it says), so I could be wrong.
> >
> Basically all LL/SC architectures have put severe limits on the usage of LL/SC
> to get them to actually work in practice. Lots of things like no stores between
> LL/SC, limited number of instruction, limited number of loads, etcx.
Yes, the powerpc ISA manual says that implementations can provide guarantee for forward progress basically if they enumerate and allow software to avoid conditions that lead to reservation loss, but it's implementation specific.
I'm not quite sure where to find the docs for POWER[678etc], but, at risk of polluting the thread with too much unfounded claims, I recall the conditions are very strict. No other loads or stores, very limited number of instructions, branch limitation, etc.
Now that doesn't necessarily say anything about forward progress of a particular CPU, only general forward progress. So a single CPU may get stuck forever if there is contention from others. However I did get the idea that POWER implementations do provide for some fairness and per-thread forward progress. Which would be a consequence of allowing cacheline to be held exclusive for specific sequence of instructions between lwarx and stwcx.
>
>
> > In fact, in other architectures (e.g., SPARC), CAS I think has been a problem in the past with livelocks,
> > because of the common need to load the source data before the CAS. The advantage there of LL/SC
> > is that the LL could signal the core to load-exclusive and prepare for SC, etc. whereas LD/CAS may
> > be more difficult to optimize that first load and squash the livelocks in hardware.
> >
> This is likely the case of the double values load/swap. AKA, compare on X and swap
> Y.
I'm not sure if I understand you correctly. I think I failed to explain myself properly: the CAS problem I heard of is not due to the instruction itself failing to make forward progress, or the entire system failing to make forward progress, but due to the software constructs that use it, failing to be able to achieve a livelock-free sequence of instructions on a per-thread basis.
Let's take the simple case of an atomic increment. A CAS architecture will have to do something like:
retry:
LD r1 := [mem]
ADD r1 := r1 + 1
CAS [mem], r1
CMP failure case
BEQ retry
Now unless the ISA or implementation also defines very specific sequences leading up to the CAS instruction as being "special", with respect to forward progress of the thread, then it's possible that other CPUs will always modify the location between the LD and the CAS, and thus this thread always fails.
LL/SC can solve that problem in hardware. A CAS architecture could too, but it would either need some fancy detection/prediction and treatment of LD that comes ahead of the CAS, or it needs an LD variant which is to be used before the CAS, with particular limited instructions between (in which case you're basically back to LL/SC).
Not that it's a canonical definition of the ISA or any particular implementation, but Linux's SPARC port has:
.globl atomic_add
.type atomic_add,#function
atomic_add: /* %o0 = increment, %o1 = atomic_ptr */
BACKOFF_SETUP(%o2)
1: lduw [%o1], %g1
add %g1, %o0, %g7
cas [%o1], %g1, %g7
cmp %g1, %g7
bne,pn %icc, BACKOFF_LABEL(2f, 1b)
nop
retl
nop
2: BACKOFF_SPIN(%o2, %o3, 1b)
.size atomic_add, .-atomic_add
Where BACKOFF* are defined for SMP kernels as doing some amount of busy waiting before trying the sequence again.
> Most CAS architectures have simply added double load CAS which handles this.
> Load X and Y from same 32 or 64 byte aligned space and swap one of them.
>
> None of the intrinsics handle multiline CAS like operations well, neither CAS nor LL/SC. Any time you
> get into multiple coherency transaction inter-context transactions you run into a lot of issues.