By: dmcq (dmcq.delete@this.fano.co.uk), June 15, 2022 9:38 am
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on June 14, 2022 11:48 am wrote:
> --- (---.delete@this.redheron.com) on June 5, 2022 6:12 pm wrote:
> > ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on June 4, 2022 11:39 am wrote:
> > > Peter Lewis (peter.delete@this.notyahoo.com) on June 1, 2022 3:55 pm wrote:
> >
> > > It is possible that in the near future it will become clear that CPU performance is determined
> > > by the number of [conditional] branches successfully predicted per cycle - and not by whether
> > > the CPU's compiler/user interface uses a fixed-length or a variable-length encoding.
> >
> > This is a great point, a rare example of skating to where the
> > puck will be rather than where it was thirty years ago.
>
> While I would agree that variable- or fixed-length encoding is not as important correctly predicting branches
> (with the implication of greater memory level parallelism for both instructions and data), branch prediction
> accuracy is less important if "microtask" parallelism is abundant (i.e., accurate branch prediction discovers
> tasks that can be done by looking ahead in a single instruction flow but their independence is evaluated per
> instruction/operation). Even hardware-supported speculative multithreading could benefit from information available
> to the compiler. (Using execution and caching resources with higher communication latency from each other has
> costs and benefits which depend of the scale of the independence, nature of communication (stream, message passing,
> "random access", etc.), and other factors. Thermal density constraints discourage proximity, but local communication
> can reduce static/idle energy (low latency → low time to completion) and active energy. Currently, support
> for lower-cost sharing seems to be limited to L1 sharing for multithreaded cores and L2 or L3 sharing within
> a cluster of cores, though network hop distance is also a factor. [I think x86 has support for explicitly writing
> back dirty cache lines to shared cache, but this seems a crude mechanism.])
>
> There is also the factor that not all branch mispredictions are equally expensive. Resolution delay is one factor;
> a branch misprediction resolved twelve cycles after prediction will typically be less expensive than one resolved
> fifty cycles after prediction. A processor might also be able to exploit value and instruction reuse after a
> mispredicted branch to reduce work and/or perform more work in parallel for execution of the corrected path.
> (This might be an aspect where software's broader perspective might be helpful.) Caching branch conditions
> has been proposed for loops, but a similar mechanism might be useful for early branch resolution in a retry;
> even merely predictions might be guided by information from the incorrectly speculated path (a similar method
> has been proposed for transactional memory retries: recording participants and prefetching them).
>
> > As regards ISA, the obvious question, then, is what can ISA do to improve this situation?
>
> [snip compiler use of predication and hardware dynamic hammock predication]
>
> Communicating information about branch independence/dependence and perhaps correlation might be useful.
>
> > - a dark horse is ARM's new (v8.8) Branch Consistent (BC) instruction. Does anyone know
> > the backstory behind this (like is it from within ARM or an Apple suggestion or ???)
> > It's somewhat vaguely described (and much may be in the hands of the implementer) but my simple-minded
> > reading of it is that it's a way to move the easy branches (as always, assuming the compiler
> > can get its act together...) out of the fancy branch machinery to much simpler branch machinery,
> > so that the few really hard branches can have much more tech thrown at them.
>
> In addition to potentially allowing trivially predicted branches to use fewer resources,
> this information might also be useful in managing global history. (Of course, a dynamic
> predictor could also detect that a branch is "consistent" and adjust global history.)
>
> The definition of "consistent" is also not entirely clear to me. Is a branch with long phases of the same direction
> consistent? (Such a branch should not be statically predicted but a one-bit predictor would be accurate and trace-like
> optimizations could have better cost-benefit than for shorter-term variability in direction.)
>
> (One could divide local-history-predictable branches into three extreme transition rate classes: high
> transition rate [predict opposite of last is accurate], low transistion rate [phased behavior, these
> are well-predicted by typical predictors], and direction-dependent transition rate [loop-like behavior
> has 100% transition rate for one direction — not-taken for backward loop branches — and typically
> a lower transition rate for the other direction, typical predictors 'ignore' the high transition rate
> direction]. Even awareness of the value of local history may be useful. [The alternative to saturating
> counter 2-bit predictors has the advantage of providing the previous direction as the hysteresis bit.
> This could be fed into a global predictor. Even the previous direction of spatially nearby branches
> might be more useful information than ordinary global history; if the "per-address" predictor could
> group temporally local branches, the information value density might be increased.])
>
> Converting branch trees into indexed jumps (where the potential targets are encoded in the instruction stream/available
> early and the predictor predicts the index) may be useful. If the frequency of different indexes was not
> known at compile time, the branch tree might often have multiple mispredictions; a single index would have
> one misprediction. Even if the condition evaluations needed for producing an index involved significant work,
> (eagerly) generating one index might be more performant than (lazy) tree evaluation.
The ARM Branch consistent seems the wrong way around to me. I haven't done this for a while but when I used to use the rarely pragma it ensured the rarely used code, e.g. for handling errors, was put out of line at the end of the code so the non-rarely code wasn't polluted by stuff that hardly ever ran - good for the cache. This would need a marker saying the branch hardly ever branched rather than that it prctically always branched.
> --- (---.delete@this.redheron.com) on June 5, 2022 6:12 pm wrote:
> > ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on June 4, 2022 11:39 am wrote:
> > > Peter Lewis (peter.delete@this.notyahoo.com) on June 1, 2022 3:55 pm wrote:
> >
> > > It is possible that in the near future it will become clear that CPU performance is determined
> > > by the number of [conditional] branches successfully predicted per cycle - and not by whether
> > > the CPU's compiler/user interface uses a fixed-length or a variable-length encoding.
> >
> > This is a great point, a rare example of skating to where the
> > puck will be rather than where it was thirty years ago.
>
> While I would agree that variable- or fixed-length encoding is not as important correctly predicting branches
> (with the implication of greater memory level parallelism for both instructions and data), branch prediction
> accuracy is less important if "microtask" parallelism is abundant (i.e., accurate branch prediction discovers
> tasks that can be done by looking ahead in a single instruction flow but their independence is evaluated per
> instruction/operation). Even hardware-supported speculative multithreading could benefit from information available
> to the compiler. (Using execution and caching resources with higher communication latency from each other has
> costs and benefits which depend of the scale of the independence, nature of communication (stream, message passing,
> "random access", etc.), and other factors. Thermal density constraints discourage proximity, but local communication
> can reduce static/idle energy (low latency → low time to completion) and active energy. Currently, support
> for lower-cost sharing seems to be limited to L1 sharing for multithreaded cores and L2 or L3 sharing within
> a cluster of cores, though network hop distance is also a factor. [I think x86 has support for explicitly writing
> back dirty cache lines to shared cache, but this seems a crude mechanism.])
>
> There is also the factor that not all branch mispredictions are equally expensive. Resolution delay is one factor;
> a branch misprediction resolved twelve cycles after prediction will typically be less expensive than one resolved
> fifty cycles after prediction. A processor might also be able to exploit value and instruction reuse after a
> mispredicted branch to reduce work and/or perform more work in parallel for execution of the corrected path.
> (This might be an aspect where software's broader perspective might be helpful.) Caching branch conditions
> has been proposed for loops, but a similar mechanism might be useful for early branch resolution in a retry;
> even merely predictions might be guided by information from the incorrectly speculated path (a similar method
> has been proposed for transactional memory retries: recording participants and prefetching them).
>
> > As regards ISA, the obvious question, then, is what can ISA do to improve this situation?
>
> [snip compiler use of predication and hardware dynamic hammock predication]
>
> Communicating information about branch independence/dependence and perhaps correlation might be useful.
>
> > - a dark horse is ARM's new (v8.8) Branch Consistent (BC) instruction. Does anyone know
> > the backstory behind this (like is it from within ARM or an Apple suggestion or ???)
> > It's somewhat vaguely described (and much may be in the hands of the implementer) but my simple-minded
> > reading of it is that it's a way to move the easy branches (as always, assuming the compiler
> > can get its act together...) out of the fancy branch machinery to much simpler branch machinery,
> > so that the few really hard branches can have much more tech thrown at them.
>
> In addition to potentially allowing trivially predicted branches to use fewer resources,
> this information might also be useful in managing global history. (Of course, a dynamic
> predictor could also detect that a branch is "consistent" and adjust global history.)
>
> The definition of "consistent" is also not entirely clear to me. Is a branch with long phases of the same direction
> consistent? (Such a branch should not be statically predicted but a one-bit predictor would be accurate and trace-like
> optimizations could have better cost-benefit than for shorter-term variability in direction.)
>
> (One could divide local-history-predictable branches into three extreme transition rate classes: high
> transition rate [predict opposite of last is accurate], low transistion rate [phased behavior, these
> are well-predicted by typical predictors], and direction-dependent transition rate [loop-like behavior
> has 100% transition rate for one direction — not-taken for backward loop branches — and typically
> a lower transition rate for the other direction, typical predictors 'ignore' the high transition rate
> direction]. Even awareness of the value of local history may be useful. [The alternative to saturating
> counter 2-bit predictors has the advantage of providing the previous direction as the hysteresis bit.
> This could be fed into a global predictor. Even the previous direction of spatially nearby branches
> might be more useful information than ordinary global history; if the "per-address" predictor could
> group temporally local branches, the information value density might be increased.])
>
> Converting branch trees into indexed jumps (where the potential targets are encoded in the instruction stream/available
> early and the predictor predicts the index) may be useful. If the frequency of different indexes was not
> known at compile time, the branch tree might often have multiple mispredictions; a single index would have
> one misprediction. Even if the condition evaluations needed for producing an index involved significant work,
> (eagerly) generating one index might be more performant than (lazy) tree evaluation.
The ARM Branch consistent seems the wrong way around to me. I haven't done this for a while but when I used to use the rarely pragma it ensured the rarely used code, e.g. for handling errors, was put out of line at the end of the code so the non-rarely code wasn't polluted by stuff that hardly ever ran - good for the cache. This would need a marker saying the branch hardly ever branched rather than that it prctically always branched.