By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), April 29, 2017 6:41 pm
Room: Moderated Discussions
Henry S (hsankey.delete@this.example.com) on April 27, 2017 7:29 pm wrote:
> Jacob Marley (jmarley123.delete@this.hotmail.com) on April 27, 2017 12:02 am wrote:
[snip]
>> Are you sure a function call would be a stall barrier on the Mill?
>
> If the value fetched is a parameter to the function call, it very definitely is!
>
> Remember, the Mill doesn't assign register names when an operation is issued; it assigns
> belt positions when an operation completes. Although the Mill supports non-faulting
> reads with a "NaR" bit on the values in the belt, the belt consists only of finished
> and final values. There is no interlocking waiting for values to become ready.
>
> Given this, in order to specify the parameters to pass to a function, their
> computation/fetch must have completed before the call instruction.
>
> So if the called function issues a long-latency load before using the passed-in value, the function
> call forces the two to be serialized, even though there's no true dependency requiring it.
In theory, a statically scheduled ISA using link-time code generation could schedule the load as soon as the address is available. As I understand it, the Mill specifically chose not to support such as all deferred loads are exclusively in the scope of the current function.
Even without knowledge of exactly when a value will be used, something like the Mill's pick-up load (which names the load and indicates when the value must be made visible by an additional operation) would be able to hoist loads with known addresses. (Again, as I understand it, the Mill does not support such.)
If one does not mind code bloat, in theory one could use the cache state (or a latency prediction) of a prefetch/load to determine a control flow path. Aside from code bloat, such also is not supported by existing compilers. (Information about prefetches and loads could be useful for software switch-on-event multithreading. Alternatively (and perhaps better), an interface could be provided to register threadlets allowing hardware to schedule their execution (though one might then want an interface to communicate information about the threadlet [I favor software providing information, though it is not always beneficial and dynamic discovery of the information has some advantages].) Even with indirection, the latency of a load dependent on a previous load that missed could be predicted allowing software to select a schedule.
(I would not be surprised if a significant fraction of cases would only need a few paths to gain most of the benefit. I.e., cache misses probably tend to have high correlation in a number of cases and not just from execution phase changes and simple reference patterns. Given that off-chip/package bandwidth is a significant constraint, a system might bias retention in cache not for approximate latency (i.e., measuring cost merely by miss count or even miss-dependence count) but account for contention for limited memory bandwidth.)
An out-of-order processor could perform similar optimizations by using load latency prediction and sending operations dependent on a high-latency load to deferred scheduler (effectively making the independent operations a separate threadlet whose operations enter to more capacity-constrained scheduler). Such two-level schedulers have been proposed academically.
One complication for design choices is that even when one level of a system (e.g., compile-time, run-time, localized run-time) has sufficient information to allocate resources statically at that level, the overhead in communicating the known good allocation may be greater than more dynamic allocation (especially if substantial dynamic resources are provided to handle the cases where earlier static allocation was not practical or simply not done). The choices are further complicated by variation in the degree and usefulness of knowledge at different levels. One cannot simply say "never put off until [run-time, decode-time, rename-time, execution-time, etc.] what can be done earlier", even though hoisting (and caching) work should be a consideration.
One assumption for "general purpose computing" is that software has no knowledge of the resources available or of contention for resources with other software. I.e., significant abstraction is used. Abstractions such as virtual memory can be useful, but assuming they are essential for all meaningful computing is problematic. Even "general purpose computing" may sometimes benefit from flattening or piercing abstraction.
(Ending before I get even more ranty, tangential, and incoherent.)
> Jacob Marley (jmarley123.delete@this.hotmail.com) on April 27, 2017 12:02 am wrote:
[snip]
>> Are you sure a function call would be a stall barrier on the Mill?
>
> If the value fetched is a parameter to the function call, it very definitely is!
>
> Remember, the Mill doesn't assign register names when an operation is issued; it assigns
> belt positions when an operation completes. Although the Mill supports non-faulting
> reads with a "NaR" bit on the values in the belt, the belt consists only of finished
> and final values. There is no interlocking waiting for values to become ready.
>
> Given this, in order to specify the parameters to pass to a function, their
> computation/fetch must have completed before the call instruction.
>
> So if the called function issues a long-latency load before using the passed-in value, the function
> call forces the two to be serialized, even though there's no true dependency requiring it.
In theory, a statically scheduled ISA using link-time code generation could schedule the load as soon as the address is available. As I understand it, the Mill specifically chose not to support such as all deferred loads are exclusively in the scope of the current function.
Even without knowledge of exactly when a value will be used, something like the Mill's pick-up load (which names the load and indicates when the value must be made visible by an additional operation) would be able to hoist loads with known addresses. (Again, as I understand it, the Mill does not support such.)
If one does not mind code bloat, in theory one could use the cache state (or a latency prediction) of a prefetch/load to determine a control flow path. Aside from code bloat, such also is not supported by existing compilers. (Information about prefetches and loads could be useful for software switch-on-event multithreading. Alternatively (and perhaps better), an interface could be provided to register threadlets allowing hardware to schedule their execution (though one might then want an interface to communicate information about the threadlet [I favor software providing information, though it is not always beneficial and dynamic discovery of the information has some advantages].) Even with indirection, the latency of a load dependent on a previous load that missed could be predicted allowing software to select a schedule.
(I would not be surprised if a significant fraction of cases would only need a few paths to gain most of the benefit. I.e., cache misses probably tend to have high correlation in a number of cases and not just from execution phase changes and simple reference patterns. Given that off-chip/package bandwidth is a significant constraint, a system might bias retention in cache not for approximate latency (i.e., measuring cost merely by miss count or even miss-dependence count) but account for contention for limited memory bandwidth.)
An out-of-order processor could perform similar optimizations by using load latency prediction and sending operations dependent on a high-latency load to deferred scheduler (effectively making the independent operations a separate threadlet whose operations enter to more capacity-constrained scheduler). Such two-level schedulers have been proposed academically.
One complication for design choices is that even when one level of a system (e.g., compile-time, run-time, localized run-time) has sufficient information to allocate resources statically at that level, the overhead in communicating the known good allocation may be greater than more dynamic allocation (especially if substantial dynamic resources are provided to handle the cases where earlier static allocation was not practical or simply not done). The choices are further complicated by variation in the degree and usefulness of knowledge at different levels. One cannot simply say "never put off until [run-time, decode-time, rename-time, execution-time, etc.] what can be done earlier", even though hoisting (and caching) work should be a consideration.
One assumption for "general purpose computing" is that software has no knowledge of the resources available or of contention for resources with other software. I.e., significant abstraction is used. Abstractions such as virtual memory can be useful, but assuming they are essential for all meaningful computing is problematic. Even "general purpose computing" may sometimes benefit from flattening or piercing abstraction.
(Ending before I get even more ranty, tangential, and incoherent.)