By: anon (anon.delete@this.anon.com), December 4, 2014 10:15 am
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on December 4, 2014 9:11 am wrote:
> anon (anon.delete@this.anon.com) on December 3, 2014 8:38 pm wrote:
> > Paul A. Clayton (paaronclayton.delete@this.gmail.com) on December 3, 2014 8:04 pm wrote:
> [snip]
> >> IBM's zSeries provides "constrained transactions" which are guaranteed to complete (no need for
> >> a fallback path) as long as certain conditions are met (including size of the code path).
> >>
> >> In theory simple atomic operations using ll/sc could
> be optimized through idiom recognition, but without an
> >> architectural guarantee a fallback path must be provided
> >> (though a simple always retry immediately mechanism
> >> would be valid and could work well even with a weaker guarantee than zSeries constrained transactions).
> >
> > Right. Power ISA provides for an architectural guarantee, although maybe it does not require
> > it of implementations. IBM's POWERx CPUs obviously guarantee it. I can't remember the
> > exact details and I don't have the ISA document handy, but indeed the requirement is a
> > strict limit on the number and types of instructions between the ll and the sc.
>
> Scanning through version 2.06 (Revision B) of the Power ISA, I did not find any constraints on
> ll/sc use (beyond same address and size; Alpha defined a minimum 16-byte lock range and it was
> unpredictable whether other memory accesses or taken branches would clear the lock_flag), but POWER
> might have specific constraints beyond those for the standard Power Server Environment.
>
> (Book II, Section 4.4, "Load and Reserve and Store Conditional Instructions", gives surprisingly
> little information, though it does mention the exclusivity hint, which can be used by hardware
> to distinguish between using ll/sc to set a lock (exclusive) and using ll/sc to update a shared
> variable that other threads are also likely to update with less regard for the value.)
There is the forward progress section. Although that is for system-wide forward progress. What I remember is that POWER CPUs also guarantee individual forward progress given particular restrictions on the critical section.
I can't find the reference yet.
> Anyway, I was referring more to positive architectural guarantees (e.g., the transaction will
> always succeed if there is only one fixed-point ALU instruction between the ll and the sc and
> the instruction sequence is within an aligned 32-byte (or instruction cache block sized) chunk,
> so fallback code would only be necessary for compatibility with older versions of the ISA).
I think "will always" is unnecessarily strict. Having an optimistic first pass allows a far less clever fallback. And unnecessary cleverness is something to be avoided :)
>
> I suspect POWER does not make even substantial negative guarantees, but is more like Alpha, declaring such
> to be unpredictable (so for a given implementation the circumstances could always clear the reservation,
> sometimes clear the reservation, or never clear the reservation). An implementation could exploit the fact
> that portable software will not use unpredictable behavior to more aggressively optimize certain cases,
> but much of that potential falls out from general practice of minimizing reservation duration.
Power ISA does not it seems, but it allows particular implementations to. POWER CPUs do. Software does actually make some assumptions about it too. Linux does not include "backoff" and livelock protection in its primitives (like e.g., compxchg based SPARC does).
>
> [snip]
>
> >> Even with side effects performance can be improved in some cases by performing operations remotely
> >> from the requester. Cache line ping pong can have a significant performance impact. Again, providing
> >> some guarantees could be useful (e.g., with certain guarantees about multiple threads "simultaneously"
> >> incrementing a counter, software could avoid hierarchical counters).
> >
> > I don't see why that would be. If you have to wait for a reply from remote before making progress,
> > it hardly matters whether that contains result of the operation or the data itself, does it?
>
> An example: Multiple threads incrementing a value to get a unique identifier would be serialized with communication
> delay in a traditional implementation (in addition to having more coherence traffic, which could be a performance
> constraint). With a centralized arbiter the communication delay is one round trip between core and arbiter
> rather than multiple serialized trips between different requesters and previous owners.
>
> (Perhaps I am missing something.)
Oh I see.
>
> (A perhaps better interface would allow each requester to ask for the response to be put in a
> mailbox since the result is probably not an immediate data dependency. I suppose such could be
> kludged in to existing systems by adding an undefined state for blocks of memory, so the
What's wrong with existing non blocking loads or OOOE mechanism or avoiding a stall?
The complexity would be in the cache coherency protocol and how to send it to the core. The difficulty I guess would be in providing it to that register only, as a one-shot deal.
requester
> would undefine a block, ask the arbiter/controller to write the response to that block, and hardware
> could treat a read of an undefined block something like a MWAIT instruction.)
> anon (anon.delete@this.anon.com) on December 3, 2014 8:38 pm wrote:
> > Paul A. Clayton (paaronclayton.delete@this.gmail.com) on December 3, 2014 8:04 pm wrote:
> [snip]
> >> IBM's zSeries provides "constrained transactions" which are guaranteed to complete (no need for
> >> a fallback path) as long as certain conditions are met (including size of the code path).
> >>
> >> In theory simple atomic operations using ll/sc could
> be optimized through idiom recognition, but without an
> >> architectural guarantee a fallback path must be provided
> >> (though a simple always retry immediately mechanism
> >> would be valid and could work well even with a weaker guarantee than zSeries constrained transactions).
> >
> > Right. Power ISA provides for an architectural guarantee, although maybe it does not require
> > it of implementations. IBM's POWERx CPUs obviously guarantee it. I can't remember the
> > exact details and I don't have the ISA document handy, but indeed the requirement is a
> > strict limit on the number and types of instructions between the ll and the sc.
>
> Scanning through version 2.06 (Revision B) of the Power ISA, I did not find any constraints on
> ll/sc use (beyond same address and size; Alpha defined a minimum 16-byte lock range and it was
> unpredictable whether other memory accesses or taken branches would clear the lock_flag), but POWER
> might have specific constraints beyond those for the standard Power Server Environment.
>
> (Book II, Section 4.4, "Load and Reserve and Store Conditional Instructions", gives surprisingly
> little information, though it does mention the exclusivity hint, which can be used by hardware
> to distinguish between using ll/sc to set a lock (exclusive) and using ll/sc to update a shared
> variable that other threads are also likely to update with less regard for the value.)
There is the forward progress section. Although that is for system-wide forward progress. What I remember is that POWER CPUs also guarantee individual forward progress given particular restrictions on the critical section.
I can't find the reference yet.
> Anyway, I was referring more to positive architectural guarantees (e.g., the transaction will
> always succeed if there is only one fixed-point ALU instruction between the ll and the sc and
> the instruction sequence is within an aligned 32-byte (or instruction cache block sized) chunk,
> so fallback code would only be necessary for compatibility with older versions of the ISA).
I think "will always" is unnecessarily strict. Having an optimistic first pass allows a far less clever fallback. And unnecessary cleverness is something to be avoided :)
>
> I suspect POWER does not make even substantial negative guarantees, but is more like Alpha, declaring such
> to be unpredictable (so for a given implementation the circumstances could always clear the reservation,
> sometimes clear the reservation, or never clear the reservation). An implementation could exploit the fact
> that portable software will not use unpredictable behavior to more aggressively optimize certain cases,
> but much of that potential falls out from general practice of minimizing reservation duration.
Power ISA does not it seems, but it allows particular implementations to. POWER CPUs do. Software does actually make some assumptions about it too. Linux does not include "backoff" and livelock protection in its primitives (like e.g., compxchg based SPARC does).
>
> [snip]
>
> >> Even with side effects performance can be improved in some cases by performing operations remotely
> >> from the requester. Cache line ping pong can have a significant performance impact. Again, providing
> >> some guarantees could be useful (e.g., with certain guarantees about multiple threads "simultaneously"
> >> incrementing a counter, software could avoid hierarchical counters).
> >
> > I don't see why that would be. If you have to wait for a reply from remote before making progress,
> > it hardly matters whether that contains result of the operation or the data itself, does it?
>
> An example: Multiple threads incrementing a value to get a unique identifier would be serialized with communication
> delay in a traditional implementation (in addition to having more coherence traffic, which could be a performance
> constraint). With a centralized arbiter the communication delay is one round trip between core and arbiter
> rather than multiple serialized trips between different requesters and previous owners.
>
> (Perhaps I am missing something.)
Oh I see.
>
> (A perhaps better interface would allow each requester to ask for the response to be put in a
> mailbox since the result is probably not an immediate data dependency. I suppose such could be
> kludged in to existing systems by adding an undefined state for blocks of memory, so the
What's wrong with existing non blocking loads or OOOE mechanism or avoiding a stall?
The complexity would be in the cache coherency protocol and how to send it to the core. The difficulty I guess would be in providing it to that register only, as a one-shot deal.
requester
> would undefine a block, ask the arbiter/controller to write the response to that block, and hardware
> could treat a read of an undefined block something like a MWAIT instruction.)