By: anon (anon.delete@this.anon.com), December 3, 2014 7:38 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on December 3, 2014 8:04 pm wrote:
> anon (anon.delete@this.anon.com) on December 3, 2014 5:08 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on December 3, 2014 11:15 am wrote:
> [snip]
> >> I'm personally a fan of RMW instructions due to the guaranteed progress and the whole potential cache
> >> coherency protocol advantage (no need for write intent hints etc). So it makes sense to me.
> >
> > We've talked about this before here, but LL/SC can guarantee progress (when it is limited
> > like POWER does), and the LL of course always carries a load-for-store signal.
>
> IBM's zSeries provides "constrained transactions" which are guaranteed to complete (no need for
> a fallback path) as long as certain conditions are met (including size of the code path).
>
> In theory simple atomic operations using ll/sc could be optimized through idiom recognition, but without an
> architectural guarantee a fallback path must be provided (though a simple always retry immediately mechanism
> would be valid and could work well even with a weaker guarantee than zSeries constrained transactions).
Right. Power ISA provides for an architectural guarantee, although maybe it does not require it of implementations. IBM's POWERx CPUs obviously guarantee it. I can't remember the exact details and I don't have the ISA document handy, but indeed the requirement is a strict limit on the number and types of instructions between the ll and the sc.
>
> [snip code density]
>
> > I wonder if power is another motivation. Theoretically if the instructions don't have side effects that
> > depend on the value, you could export such operations to remote owner of the cacheline in your CC protocol,
> > or to a memory controller if the cacheline is not owned, without blocking the core on the read.
>
> Even with side effects performance can be improved in some cases by performing operations remotely
> from the requester. Cache line ping pong can have a significant performance impact. Again, providing
> some guarantees could be useful (e.g., with certain guarantees about multiple threads "simultaneously"
> incrementing a counter, software could avoid hierarchical counters).
I don't see why that would be. If you have to wait for a reply from remote before making progress, it hardly matters whether that contains result of the operation or the data itself, does it?
> anon (anon.delete@this.anon.com) on December 3, 2014 5:08 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on December 3, 2014 11:15 am wrote:
> [snip]
> >> I'm personally a fan of RMW instructions due to the guaranteed progress and the whole potential cache
> >> coherency protocol advantage (no need for write intent hints etc). So it makes sense to me.
> >
> > We've talked about this before here, but LL/SC can guarantee progress (when it is limited
> > like POWER does), and the LL of course always carries a load-for-store signal.
>
> IBM's zSeries provides "constrained transactions" which are guaranteed to complete (no need for
> a fallback path) as long as certain conditions are met (including size of the code path).
>
> In theory simple atomic operations using ll/sc could be optimized through idiom recognition, but without an
> architectural guarantee a fallback path must be provided (though a simple always retry immediately mechanism
> would be valid and could work well even with a weaker guarantee than zSeries constrained transactions).
Right. Power ISA provides for an architectural guarantee, although maybe it does not require it of implementations. IBM's POWERx CPUs obviously guarantee it. I can't remember the exact details and I don't have the ISA document handy, but indeed the requirement is a strict limit on the number and types of instructions between the ll and the sc.
>
> [snip code density]
>
> > I wonder if power is another motivation. Theoretically if the instructions don't have side effects that
> > depend on the value, you could export such operations to remote owner of the cacheline in your CC protocol,
> > or to a memory controller if the cacheline is not owned, without blocking the core on the read.
>
> Even with side effects performance can be improved in some cases by performing operations remotely
> from the requester. Cache line ping pong can have a significant performance impact. Again, providing
> some guarantees could be useful (e.g., with certain guarantees about multiple threads "simultaneously"
> incrementing a counter, software could avoid hierarchical counters).
I don't see why that would be. If you have to wait for a reply from remote before making progress, it hardly matters whether that contains result of the operation or the data itself, does it?