By: anon (anon.delete@this.anon.com), August 26, 2014 5:06 am
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on August 26, 2014 4:41 am wrote:
> Patrick Chase (patrickjchase.delete@this.gmail.com) on August 25, 2014 11:32 pm wrote:
> > Randomized exponential backoff is the easy one that just about everybody uses.
> > Note that exponential backoff alone (without a separate randomized starting
> > delay for each thread) doesn't provably resolve livelock, though.
> >
>
> And unfortunately this can actually have a pretty significant impact on performance as well.
>
> One of the reasons that I'm currently not that big of a fan of LL/SC, is that it is basically used ONLY for
> analogs of CMPXCHG, et al. But even though it is only used as an analog for the basic primitives, its design
> is such that it severely limited the optimization capabilities because it is basically unbounded.
>
> For an ISA definition, I would much rather have the hard primitives like CMPXCHG such that you have the
> potential for much more optimization. For something like CMPXCHG, you can actually export the operation
> into the coherence infrastructure and aren't necessarily required to do it at the core of invocation. This
> can allow for some much more efficient coherency flows in the presence of high contention. Being able to
> handle the interlock for hot lines outside the core can have a significant impact on performance. Its a
> bit of flexibility that will be nice to have as hardware contexts continue to climb ever higher.
>
> Many others have similar thoughts, for instance, while RISC-V has LL/SC, it
> also has separate primitives for things like Fetch_and_ADD with one of the
> main ideas being that you can export the operation outside of the core.
>
> Basically I look at LL/SC these days as a rather poor and broken implementation
> of transactional memory, with many of the downsides and none of the advantages.
I had the idea from somewhere that LL/SC in POWER CPUs had similar kinds of hardware guarantees when used in very specific, limited sequences. That is, the hardware can take and hold the line to avoid livelocks, will avoid state transitions, etc. I don't have a reference off the top of my head (or the powerpc ISA manual handy to see what it says), so I could be wrong.
In fact, in other architectures (e.g., SPARC), CAS I think has been a problem in the past with livelocks, because of the common need to load the source data before the CAS. The advantage there of LL/SC is that the LL could signal the core to load-exclusive and prepare for SC, etc. whereas LD/CAS may be more difficult to optimize that first load and squash the livelocks in hardware.
> Patrick Chase (patrickjchase.delete@this.gmail.com) on August 25, 2014 11:32 pm wrote:
> > Randomized exponential backoff is the easy one that just about everybody uses.
> > Note that exponential backoff alone (without a separate randomized starting
> > delay for each thread) doesn't provably resolve livelock, though.
> >
>
> And unfortunately this can actually have a pretty significant impact on performance as well.
>
> One of the reasons that I'm currently not that big of a fan of LL/SC, is that it is basically used ONLY for
> analogs of CMPXCHG, et al. But even though it is only used as an analog for the basic primitives, its design
> is such that it severely limited the optimization capabilities because it is basically unbounded.
>
> For an ISA definition, I would much rather have the hard primitives like CMPXCHG such that you have the
> potential for much more optimization. For something like CMPXCHG, you can actually export the operation
> into the coherence infrastructure and aren't necessarily required to do it at the core of invocation. This
> can allow for some much more efficient coherency flows in the presence of high contention. Being able to
> handle the interlock for hot lines outside the core can have a significant impact on performance. Its a
> bit of flexibility that will be nice to have as hardware contexts continue to climb ever higher.
>
> Many others have similar thoughts, for instance, while RISC-V has LL/SC, it
> also has separate primitives for things like Fetch_and_ADD with one of the
> main ideas being that you can export the operation outside of the core.
>
> Basically I look at LL/SC these days as a rather poor and broken implementation
> of transactional memory, with many of the downsides and none of the advantages.
I had the idea from somewhere that LL/SC in POWER CPUs had similar kinds of hardware guarantees when used in very specific, limited sequences. That is, the hardware can take and hold the line to avoid livelocks, will avoid state transitions, etc. I don't have a reference off the top of my head (or the powerpc ISA manual handy to see what it says), so I could be wrong.
In fact, in other architectures (e.g., SPARC), CAS I think has been a problem in the past with livelocks, because of the common need to load the source data before the CAS. The advantage there of LL/SC is that the LL could signal the core to load-exclusive and prepare for SC, etc. whereas LD/CAS may be more difficult to optimize that first load and squash the livelocks in hardware.