By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), December 4, 2014 9:11 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on December 3, 2014 8:38 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on December 3, 2014 8:04 pm wrote:
[snip]
>> IBM's zSeries provides "constrained transactions" which are guaranteed to complete (no need for
>> a fallback path) as long as certain conditions are met (including size of the code path).
>>
>> In theory simple atomic operations using ll/sc could be optimized through idiom recognition, but without an
>> architectural guarantee a fallback path must be provided
>> (though a simple always retry immediately mechanism
>> would be valid and could work well even with a weaker guarantee than zSeries constrained transactions).
>
> Right. Power ISA provides for an architectural guarantee, although maybe it does not require
> it of implementations. IBM's POWERx CPUs obviously guarantee it. I can't remember the
> exact details and I don't have the ISA document handy, but indeed the requirement is a
> strict limit on the number and types of instructions between the ll and the sc.
Scanning through version 2.06 (Revision B) of the Power ISA, I did not find any constraints on ll/sc use (beyond same address and size; Alpha defined a minimum 16-byte lock range and it was unpredictable whether other memory accesses or taken branches would clear the lock_flag), but POWER might have specific constraints beyond those for the standard Power Server Environment.
(Book II, Section 4.4, "Load and Reserve and Store Conditional Instructions", gives surprisingly little information, though it does mention the exclusivity hint, which can be used by hardware to distinguish between using ll/sc to set a lock (exclusive) and using ll/sc to update a shared variable that other threads are also likely to update with less regard for the value.)
Anyway, I was referring more to positive architectural guarantees (e.g., the transaction will always succeed if there is only one fixed-point ALU instruction between the ll and the sc and the instruction sequence is within an aligned 32-byte (or instruction cache block sized) chunk, so fallback code would only be necessary for compatibility with older versions of the ISA).
I suspect POWER does not make even substantial negative guarantees, but is more like Alpha, declaring such to be unpredictable (so for a given implementation the circumstances could always clear the reservation, sometimes clear the reservation, or never clear the reservation). An implementation could exploit the fact that portable software will not use unpredictable behavior to more aggressively optimize certain cases, but much of that potential falls out from general practice of minimizing reservation duration.
[snip]
>> Even with side effects performance can be improved in some cases by performing operations remotely
>> from the requester. Cache line ping pong can have a significant performance impact. Again, providing
>> some guarantees could be useful (e.g., with certain guarantees about multiple threads "simultaneously"
>> incrementing a counter, software could avoid hierarchical counters).
>
> I don't see why that would be. If you have to wait for a reply from remote before making progress,
> it hardly matters whether that contains result of the operation or the data itself, does it?
An example: Multiple threads incrementing a value to get a unique identifier would be serialized with communication delay in a traditional implementation (in addition to having more coherence traffic, which could be a performance constraint). With a centralized arbiter the communication delay is one round trip between core and arbiter rather than multiple serialized trips between different requesters and previous owners.
(Perhaps I am missing something.)
(A perhaps better interface would allow each requester to ask for the response to be put in a mailbox since the result is probably not an immediate data dependency. I suppose such could be kludged in to existing systems by adding an undefined state for blocks of memory, so the requester would undefine a block, ask the arbiter/controller to write the response to that block, and hardware could treat a read of an undefined block something like a MWAIT instruction.)
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on December 3, 2014 8:04 pm wrote:
[snip]
>> IBM's zSeries provides "constrained transactions" which are guaranteed to complete (no need for
>> a fallback path) as long as certain conditions are met (including size of the code path).
>>
>> In theory simple atomic operations using ll/sc could be optimized through idiom recognition, but without an
>> architectural guarantee a fallback path must be provided
>> (though a simple always retry immediately mechanism
>> would be valid and could work well even with a weaker guarantee than zSeries constrained transactions).
>
> Right. Power ISA provides for an architectural guarantee, although maybe it does not require
> it of implementations. IBM's POWERx CPUs obviously guarantee it. I can't remember the
> exact details and I don't have the ISA document handy, but indeed the requirement is a
> strict limit on the number and types of instructions between the ll and the sc.
Scanning through version 2.06 (Revision B) of the Power ISA, I did not find any constraints on ll/sc use (beyond same address and size; Alpha defined a minimum 16-byte lock range and it was unpredictable whether other memory accesses or taken branches would clear the lock_flag), but POWER might have specific constraints beyond those for the standard Power Server Environment.
(Book II, Section 4.4, "Load and Reserve and Store Conditional Instructions", gives surprisingly little information, though it does mention the exclusivity hint, which can be used by hardware to distinguish between using ll/sc to set a lock (exclusive) and using ll/sc to update a shared variable that other threads are also likely to update with less regard for the value.)
Anyway, I was referring more to positive architectural guarantees (e.g., the transaction will always succeed if there is only one fixed-point ALU instruction between the ll and the sc and the instruction sequence is within an aligned 32-byte (or instruction cache block sized) chunk, so fallback code would only be necessary for compatibility with older versions of the ISA).
I suspect POWER does not make even substantial negative guarantees, but is more like Alpha, declaring such to be unpredictable (so for a given implementation the circumstances could always clear the reservation, sometimes clear the reservation, or never clear the reservation). An implementation could exploit the fact that portable software will not use unpredictable behavior to more aggressively optimize certain cases, but much of that potential falls out from general practice of minimizing reservation duration.
[snip]
>> Even with side effects performance can be improved in some cases by performing operations remotely
>> from the requester. Cache line ping pong can have a significant performance impact. Again, providing
>> some guarantees could be useful (e.g., with certain guarantees about multiple threads "simultaneously"
>> incrementing a counter, software could avoid hierarchical counters).
>
> I don't see why that would be. If you have to wait for a reply from remote before making progress,
> it hardly matters whether that contains result of the operation or the data itself, does it?
An example: Multiple threads incrementing a value to get a unique identifier would be serialized with communication delay in a traditional implementation (in addition to having more coherence traffic, which could be a performance constraint). With a centralized arbiter the communication delay is one round trip between core and arbiter rather than multiple serialized trips between different requesters and previous owners.
(Perhaps I am missing something.)
(A perhaps better interface would allow each requester to ask for the response to be put in a mailbox since the result is probably not an immediate data dependency. I suppose such could be kludged in to existing systems by adding an undefined state for blocks of memory, so the requester would undefine a block, ask the arbiter/controller to write the response to that block, and hardware could treat a read of an undefined block something like a MWAIT instruction.)