By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), December 4, 2014 1:06 pm
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on December 4, 2014 10:15 am wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on December 4, 2014 9:11 am wrote:
[snip]
> There is the forward progress section. Although that is for system-wide forward
> progress.
(I did not notice that next section. Thank you for bringing it to my attention.)
Making guarantees about system-wide forward progress is also an optional implementation-specific feature. (A system can also have unadvertised implementation-specific features. Using undocumented behavior as statistical information for performance tuning might be reasonable, but it can be dangerous to assume one understands the implementation so completely that one can take a behavior as guaranteed, even if it seems the only reasonable explanation of observed behavior. (Even if one determined that a given chip is implemented such that the behavior is guaranteed, it is possible that defects even within the same stepping might cause deviation while still upholding architectural behavior and so being sellable.)
> What I remember is that POWER CPUs also guarantee individual forward
> progress given particular restrictions on the critical section.
>
> I can't find the reference yet.
It does not surprise me that a more recent POWER implementation would have stronger guarantees since IBM is a high-end systems vendor (iSeries could broadly exploit such implementation-specific features since it uses an intermediate binary format; for pSeries, AIX and its system libraries (and DB2) could readily exploit such features and modifying selected open source software might be practical).
[snip]
> Power ISA does not it seems, but it allows particular implementations to. POWER CPUs do.
> Software does actually make some assumptions about it too. Linux does not include "backoff"
> and livelock protection in its primitives (like e.g., compxchg based SPARC does).
Interesting.
[snip]
>> (A perhaps better interface would allow each requester to ask for the response to be put in a
>> mailbox since the result is probably not an immediate data dependency. I suppose such could be
>> kludged in to existing systems by adding an undefined state for blocks of memory, so the
>
> What's wrong with existing non blocking loads or OOOE mechanism or avoiding a stall?
OoO has a relatively limited window and using it for this would be like not allowing later instructions to commit because a prefetch instruction has not yet committed. The above mentioned mechanism would be more like (software) asynchronous I/O or stalling the reading thread and spawning a thread to continue with the portion of the task that does not have a data dependency on the read (which thread would join with the stalled reading thread when its non-dependent work was done and might check to see if the value is ready early). Supporting metadata allows software to use the full data range and avoids the need for hardware and software collaboration on how metadata is encoded in the data. (The other thread might be dependent on the validity of the pending read. E.g., a higher level software recovery mechanism might be required if a hardware failure or other event prevents the value from ever being available, but such would be required anyway for the permanently stalled thread.)
I don't know of any ISA that provides, e.g., a fire-and-forget atomic increment (which might be useful for performance counters). (Incidentally, there may also be case where a reader does not especially care about the temporal precision of a value, i.e., stale data may be good enough for some uses.)
> The complexity would be in the cache coherency protocol and how to send it to the core. The
> difficulty I guess would be in providing it to that register only, as a one-shot deal.
The larger address space of memory would also seem to facilitate scalability. (The MIPS MultiThreading Application Specific Extension defined Inter-Thread Communication Storage, which seemed a bit strange to me. Each 64-bit cell could be addressed by one of sixteen views, using 6 through 3 of the address, which provide different atomic functionality.)
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on December 4, 2014 9:11 am wrote:
[snip]
> There is the forward progress section. Although that is for system-wide forward
> progress.
(I did not notice that next section. Thank you for bringing it to my attention.)
Making guarantees about system-wide forward progress is also an optional implementation-specific feature. (A system can also have unadvertised implementation-specific features. Using undocumented behavior as statistical information for performance tuning might be reasonable, but it can be dangerous to assume one understands the implementation so completely that one can take a behavior as guaranteed, even if it seems the only reasonable explanation of observed behavior. (Even if one determined that a given chip is implemented such that the behavior is guaranteed, it is possible that defects even within the same stepping might cause deviation while still upholding architectural behavior and so being sellable.)
> What I remember is that POWER CPUs also guarantee individual forward
> progress given particular restrictions on the critical section.
>
> I can't find the reference yet.
It does not surprise me that a more recent POWER implementation would have stronger guarantees since IBM is a high-end systems vendor (iSeries could broadly exploit such implementation-specific features since it uses an intermediate binary format; for pSeries, AIX and its system libraries (and DB2) could readily exploit such features and modifying selected open source software might be practical).
[snip]
> Power ISA does not it seems, but it allows particular implementations to. POWER CPUs do.
> Software does actually make some assumptions about it too. Linux does not include "backoff"
> and livelock protection in its primitives (like e.g., compxchg based SPARC does).
Interesting.
[snip]
>> (A perhaps better interface would allow each requester to ask for the response to be put in a
>> mailbox since the result is probably not an immediate data dependency. I suppose such could be
>> kludged in to existing systems by adding an undefined state for blocks of memory, so the
>
> What's wrong with existing non blocking loads or OOOE mechanism or avoiding a stall?
OoO has a relatively limited window and using it for this would be like not allowing later instructions to commit because a prefetch instruction has not yet committed. The above mentioned mechanism would be more like (software) asynchronous I/O or stalling the reading thread and spawning a thread to continue with the portion of the task that does not have a data dependency on the read (which thread would join with the stalled reading thread when its non-dependent work was done and might check to see if the value is ready early). Supporting metadata allows software to use the full data range and avoids the need for hardware and software collaboration on how metadata is encoded in the data. (The other thread might be dependent on the validity of the pending read. E.g., a higher level software recovery mechanism might be required if a hardware failure or other event prevents the value from ever being available, but such would be required anyway for the permanently stalled thread.)
I don't know of any ISA that provides, e.g., a fire-and-forget atomic increment (which might be useful for performance counters). (Incidentally, there may also be case where a reader does not especially care about the temporal precision of a value, i.e., stale data may be good enough for some uses.)
> The complexity would be in the cache coherency protocol and how to send it to the core. The
> difficulty I guess would be in providing it to that register only, as a one-shot deal.
The larger address space of memory would also seem to facilitate scalability. (The MIPS MultiThreading Application Specific Extension defined Inter-Thread Communication Storage, which seemed a bit strange to me. Each 64-bit cell could be addressed by one of sixteen views, using 6 through 3 of the address, which provide different atomic functionality.)