By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), April 12, 2017 7:02 pm
Room: Moderated Discussions
Megol (golem960.delete@this.gmail.com) on April 12, 2017 3:13 pm wrote:
[snip]
> The delayed load I mentioned (quoted below) takes a time when it must be ready and inserted into the belt,
> that load will be stored in a snooping load buffer so that the value fetched is always the (locally visible)
> current one. If there's no load buffer free than the pipeline must stall,
It is impossible for a load buffer not to be available since the specializer is aware of how many entries are available and how many loads are pending within a function. (Load buffer addresses and other metadata are spilled on function calls. One interesting aspect of the Mill is that a load could be hoisted many cycles before use when a function is called after the load initiates but before it is completed. If the ability to detect a high latency cache miss was present, in theory software could implement a crude form of switch-on-event multithreading.)
[snip]
> The Mill is different to ordinary designs in several ways. One important part is that the compiler
> doesn't generate directly executable binary code, it generate a "virtual" binary that is translated
> to a model specific binary encoding by a second stage. It uses the belt abstraction which is different
> than normal registers and requires a whole different set of optimizations to avoid spill/fill to
> ? and rescue operations (duplicating data that is about to fall off the belt). There are more quirks
> that makes it a hard target to even generate basic non-optimized code.
Spill and fill would go to the scratchpad, which (in this context) is effectively a stack frame that can hold values and their metadata. (The scratchpad can also act as rotating registers for software pipelined loop bodies too large to fit on the belt. At least that was my impression.)
The use of metadata to reduce code size (and support not-a-value) is also interesting.
[snip]
> The delayed load I mentioned (quoted below) takes a time when it must be ready and inserted into the belt,
> that load will be stored in a snooping load buffer so that the value fetched is always the (locally visible)
> current one. If there's no load buffer free than the pipeline must stall,
It is impossible for a load buffer not to be available since the specializer is aware of how many entries are available and how many loads are pending within a function. (Load buffer addresses and other metadata are spilled on function calls. One interesting aspect of the Mill is that a load could be hoisted many cycles before use when a function is called after the load initiates but before it is completed. If the ability to detect a high latency cache miss was present, in theory software could implement a crude form of switch-on-event multithreading.)
[snip]
> The Mill is different to ordinary designs in several ways. One important part is that the compiler
> doesn't generate directly executable binary code, it generate a "virtual" binary that is translated
> to a model specific binary encoding by a second stage. It uses the belt abstraction which is different
> than normal registers and requires a whole different set of optimizations to avoid spill/fill to
> ? and rescue operations (duplicating data that is about to fall off the belt). There are more quirks
> that makes it a hard target to even generate basic non-optimized code.
Spill and fill would go to the scratchpad, which (in this context) is effectively a stack frame that can hold values and their metadata. (The scratchpad can also act as rotating registers for software pipelined loop bodies too large to fit on the belt. At least that was my impression.)
The use of metadata to reduce code size (and support not-a-value) is also interesting.