By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), June 14, 2022 11:48 am
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on June 5, 2022 6:12 pm wrote:
> ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on June 4, 2022 11:39 am wrote:
> > Peter Lewis (peter.delete@this.notyahoo.com) on June 1, 2022 3:55 pm wrote:
>
> > It is possible that in the near future it will become clear that CPU performance is determined
> > by the number of [conditional] branches successfully predicted per cycle - and not by whether
> > the CPU's compiler/user interface uses a fixed-length or a variable-length encoding.
>
> This is a great point, a rare example of skating to where the
> puck will be rather than where it was thirty years ago.
While I would agree that variable- or fixed-length encoding is not as important correctly predicting branches (with the implication of greater memory level parallelism for both instructions and data), branch prediction accuracy is less important if "microtask" parallelism is abundant (i.e., accurate branch prediction discovers tasks that can be done by looking ahead in a single instruction flow but their independence is evaluated per instruction/operation). Even hardware-supported speculative multithreading could benefit from information available to the compiler. (Using execution and caching resources with higher communication latency from each other has costs and benefits which depend of the scale of the independence, nature of communication (stream, message passing, "random access", etc.), and other factors. Thermal density constraints discourage proximity, but local communication can reduce static/idle energy (low latency → low time to completion) and active energy. Currently, support for lower-cost sharing seems to be limited to L1 sharing for multithreaded cores and L2 or L3 sharing within a cluster of cores, though network hop distance is also a factor. [I think x86 has support for explicitly writing back dirty cache lines to shared cache, but this seems a crude mechanism.])
There is also the factor that not all branch mispredictions are equally expensive. Resolution delay is one factor; a branch misprediction resolved twelve cycles after prediction will typically be less expensive than one resolved fifty cycles after prediction. A processor might also be able to exploit value and instruction reuse after a mispredicted branch to reduce work and/or perform more work in parallel for execution of the corrected path. (This might be an aspect where software's broader perspective might be helpful.) Caching branch conditions has been proposed for loops, but a similar mechanism might be useful for early branch resolution in a retry; even merely predictions might be guided by information from the incorrectly speculated path (a similar method has been proposed for transactional memory retries: recording participants and prefetching them).
> As regards ISA, the obvious question, then, is what can ISA do to improve this situation?
[snip compiler use of predication and hardware dynamic hammock predication]
Communicating information about branch independence/dependence and perhaps correlation might be useful.
> - a dark horse is ARM's new (v8.8) Branch Consistent (BC) instruction. Does anyone know
> the backstory behind this (like is it from within ARM or an Apple suggestion or ???)
> It's somewhat vaguely described (and much may be in the hands of the implementer) but my simple-minded
> reading of it is that it's a way to move the easy branches (as always, assuming the compiler
> can get its act together...) out of the fancy branch machinery to much simpler branch machinery,
> so that the few really hard branches can have much more tech thrown at them.
In addition to potentially allowing trivially predicted branches to use fewer resources, this information might also be useful in managing global history. (Of course, a dynamic predictor could also detect that a branch is "consistent" and adjust global history.)
The definition of "consistent" is also not entirely clear to me. Is a branch with long phases of the same direction consistent? (Such a branch should not be statically predicted but a one-bit predictor would be accurate and trace-like optimizations could have better cost-benefit than for shorter-term variability in direction.)
(One could divide local-history-predictable branches into three extreme transition rate classes: high transition rate [predict opposite of last is accurate], low transistion rate [phased behavior, these are well-predicted by typical predictors], and direction-dependent transition rate [loop-like behavior has 100% transition rate for one direction — not-taken for backward loop branches — and typically a lower transition rate for the other direction, typical predictors 'ignore' the high transition rate direction]. Even awareness of the value of local history may be useful. [The alternative to saturating counter 2-bit predictors has the advantage of providing the previous direction as the hysteresis bit. This could be fed into a global predictor. Even the previous direction of spatially nearby branches might be more useful information than ordinary global history; if the "per-address" predictor could group temporally local branches, the information value density might be increased.])
Converting branch trees into indexed jumps (where the potential targets are encoded in the instruction stream/available early and the predictor predicts the index) may be useful. If the frequency of different indexes was not known at compile time, the branch tree might often have multiple mispredictions; a single index would have one misprediction. Even if the condition evaluations needed for producing an index involved significant work, (eagerly) generating one index might be more performant than (lazy) tree evaluation.
> ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on June 4, 2022 11:39 am wrote:
> > Peter Lewis (peter.delete@this.notyahoo.com) on June 1, 2022 3:55 pm wrote:
>
> > It is possible that in the near future it will become clear that CPU performance is determined
> > by the number of [conditional] branches successfully predicted per cycle - and not by whether
> > the CPU's compiler/user interface uses a fixed-length or a variable-length encoding.
>
> This is a great point, a rare example of skating to where the
> puck will be rather than where it was thirty years ago.
While I would agree that variable- or fixed-length encoding is not as important correctly predicting branches (with the implication of greater memory level parallelism for both instructions and data), branch prediction accuracy is less important if "microtask" parallelism is abundant (i.e., accurate branch prediction discovers tasks that can be done by looking ahead in a single instruction flow but their independence is evaluated per instruction/operation). Even hardware-supported speculative multithreading could benefit from information available to the compiler. (Using execution and caching resources with higher communication latency from each other has costs and benefits which depend of the scale of the independence, nature of communication (stream, message passing, "random access", etc.), and other factors. Thermal density constraints discourage proximity, but local communication can reduce static/idle energy (low latency → low time to completion) and active energy. Currently, support for lower-cost sharing seems to be limited to L1 sharing for multithreaded cores and L2 or L3 sharing within a cluster of cores, though network hop distance is also a factor. [I think x86 has support for explicitly writing back dirty cache lines to shared cache, but this seems a crude mechanism.])
There is also the factor that not all branch mispredictions are equally expensive. Resolution delay is one factor; a branch misprediction resolved twelve cycles after prediction will typically be less expensive than one resolved fifty cycles after prediction. A processor might also be able to exploit value and instruction reuse after a mispredicted branch to reduce work and/or perform more work in parallel for execution of the corrected path. (This might be an aspect where software's broader perspective might be helpful.) Caching branch conditions has been proposed for loops, but a similar mechanism might be useful for early branch resolution in a retry; even merely predictions might be guided by information from the incorrectly speculated path (a similar method has been proposed for transactional memory retries: recording participants and prefetching them).
> As regards ISA, the obvious question, then, is what can ISA do to improve this situation?
[snip compiler use of predication and hardware dynamic hammock predication]
Communicating information about branch independence/dependence and perhaps correlation might be useful.
> - a dark horse is ARM's new (v8.8) Branch Consistent (BC) instruction. Does anyone know
> the backstory behind this (like is it from within ARM or an Apple suggestion or ???)
> It's somewhat vaguely described (and much may be in the hands of the implementer) but my simple-minded
> reading of it is that it's a way to move the easy branches (as always, assuming the compiler
> can get its act together...) out of the fancy branch machinery to much simpler branch machinery,
> so that the few really hard branches can have much more tech thrown at them.
In addition to potentially allowing trivially predicted branches to use fewer resources, this information might also be useful in managing global history. (Of course, a dynamic predictor could also detect that a branch is "consistent" and adjust global history.)
The definition of "consistent" is also not entirely clear to me. Is a branch with long phases of the same direction consistent? (Such a branch should not be statically predicted but a one-bit predictor would be accurate and trace-like optimizations could have better cost-benefit than for shorter-term variability in direction.)
(One could divide local-history-predictable branches into three extreme transition rate classes: high transition rate [predict opposite of last is accurate], low transistion rate [phased behavior, these are well-predicted by typical predictors], and direction-dependent transition rate [loop-like behavior has 100% transition rate for one direction — not-taken for backward loop branches — and typically a lower transition rate for the other direction, typical predictors 'ignore' the high transition rate direction]. Even awareness of the value of local history may be useful. [The alternative to saturating counter 2-bit predictors has the advantage of providing the previous direction as the hysteresis bit. This could be fed into a global predictor. Even the previous direction of spatially nearby branches might be more useful information than ordinary global history; if the "per-address" predictor could group temporally local branches, the information value density might be increased.])
Converting branch trees into indexed jumps (where the potential targets are encoded in the instruction stream/available early and the predictor predicts the index) may be useful. If the frequency of different indexes was not known at compile time, the branch tree might often have multiple mispredictions; a single index would have one misprediction. Even if the condition evaluations needed for producing an index involved significant work, (eagerly) generating one index might be more performant than (lazy) tree evaluation.