By: Megol (golem960.delete@this.gmail.com), April 16, 2017 3:05 pm
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on April 16, 2017 4:08 am wrote:
> Regarding "stall everything if any static schedule guess turns out wrong", this is equally true for
> any standard out-of-order superscalar CPU. They also are completely stalled due to cache misses for
> large parts of many real programs (because of the limited number of later independent instructions
> that can be executed before having the load result).
This seems like a fundamental misunderstanding of OoO advantages. No, a statically scheduled VLIW processor isn't comparable to a dynamic scheduled one. The VLIW will stall more often as the OoO processor can do speculative execution in more cases than the VLIW.
> The "Mill" load instruction encodings allow the
> hiding of any unpredictable cache misses at least as well as in an out-of-order CPU. In fact in "Mill"
> the distance between the initiation of the load and the use of the result can be larger than the instruction
> window of any existing out-of-order CPU, so it can hide even higher latencies.
No. Simple example:
a <- *x
b <- *y
c <- *a
d <- *b
e <- c + d
a and/or b may miss the cache requiring c and/or d to be delayed until the load value(s) return. For a statically scheduled design that is 4 cases to be handled (a misses but b hits, a misses and b misses etc.) however even if the compiler did that optimally it still would be less efficient than the OoO design that do the loads as early as possible.
> Regarding "stall everything if any static schedule guess turns out wrong", this is equally true for
> any standard out-of-order superscalar CPU. They also are completely stalled due to cache misses for
> large parts of many real programs (because of the limited number of later independent instructions
> that can be executed before having the load result).
This seems like a fundamental misunderstanding of OoO advantages. No, a statically scheduled VLIW processor isn't comparable to a dynamic scheduled one. The VLIW will stall more often as the OoO processor can do speculative execution in more cases than the VLIW.
> The "Mill" load instruction encodings allow the
> hiding of any unpredictable cache misses at least as well as in an out-of-order CPU. In fact in "Mill"
> the distance between the initiation of the load and the use of the result can be larger than the instruction
> window of any existing out-of-order CPU, so it can hide even higher latencies.
No. Simple example:
a <- *x
b <- *y
c <- *a
d <- *b
e <- c + d
a and/or b may miss the cache requiring c and/or d to be delayed until the load value(s) return. For a statically scheduled design that is 4 cases to be handled (a misses but b hits, a misses and b misses etc.) however even if the compiler did that optimally it still would be less efficient than the OoO design that do the loads as early as possible.