By: --- (---.delete@this.redheron.com), June 14, 2022 1:40 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on June 14, 2022 11:48 am wrote:
Currently, support
> for lower-cost sharing seems to be limited to L1 sharing for multithreaded cores and L2 or L3 sharing within
> a cluster of cores, though network hop distance is also a factor. [I think x86 has support for explicitly writing
> back dirty cache lines to shared cache, but this seems a crude mechanism.])
This is getting into the weeds, but are you sure?
My understanding is that
(a) Intel (and IBM) are MESIF (IBM calls it MERSI) which allows for cache-to-cache transfers of unmodified lines, but not modified lines
(b) Apple is MOESI which does allow cache-to-cache transfer of modified lines (updated in the 2020 new protocol, probably not yet implemented in M1, to be much more aggressive in terms of sideways transfers). I think Apple as of M1 allows cache-to-cache transfers within a cluster, but between clusters is something of a hack; not the full cost of up to DRAM and back, but not as cheap as it will be with the new protocol.
> There is also the factor that not all branch mispredictions are equally expensive. Resolution delay is one factor;
> a branch misprediction resolved twelve cycles after prediction will typically be less expensive than one resolved
> fifty cycles after prediction. A processor might also be able to exploit value and instruction reuse after a
> mispredicted branch to reduce work and/or perform more work in parallel for execution of the corrected path.
> (This might be an aspect where software's broader perspective might be helpful.) Caching branch conditions
> has been proposed for loops, but a similar mechanism might be useful for early branch resolution in a retry;
> even merely predictions might be guided by information from the incorrectly speculated path (a similar method
> has been proposed for transactional memory retries: recording participants and prefetching them).
Apple
- has a predictor as to whether atomic operations (like CAS or store conditional) will succeed.
- this is used in more than one way, but of relevance right now is that it is used to throttle the front-end when appropriate, the idea being that you probably can't change the state that's already in the machine (which will probably have to be flushed if the branch prediction machinery predicted, eg, exiting a CAS loop, whereas the atomic predictor believes you will retry the loop). The atomic predictor also trains the brain predictor, but that's after the fact.
- also relevant to your "recording participants" suggestion, the atomic predictor bases its prediction, if no counter history is available, on things like whether the number to be stored in a "test and set" location is 0 or 1 (assume a semaphore counter, and assume low value means low contention) which I thought was a cute hack that probably works surprisingly well!
> > As regards ISA, the obvious question, then, is what can ISA do to improve this situation?
>
> [snip compiler use of predication and hardware dynamic hammock predication]
>
> Communicating information about branch independence/dependence and perhaps correlation might be useful.
>
> > - a dark horse is ARM's new (v8.8) Branch Consistent (BC) instruction. Does anyone know
> > the backstory behind this (like is it from within ARM or an Apple suggestion or ???)
> > It's somewhat vaguely described (and much may be in the hands of the implementer) but my simple-minded
> > reading of it is that it's a way to move the easy branches (as always, assuming the compiler
> > can get its act together...) out of the fancy branch machinery to much simpler branch machinery,
> > so that the few really hard branches can have much more tech thrown at them.
>
> In addition to potentially allowing trivially predicted branches to use fewer resources,
> this information might also be useful in managing global history. (Of course, a dynamic
> predictor could also detect that a branch is "consistent" and adjust global history.)
>
> The definition of "consistent" is also not entirely clear to me. Is a branch with long phases of the same direction
> consistent? (Such a branch should not be statically predicted but a one-bit predictor would be accurate and trace-like
> optimizations could have better cost-benefit than for shorter-term variability in direction.)
>
> (One could divide local-history-predictable branches into three extreme transition rate classes: high
> transition rate [predict opposite of last is accurate], low transistion rate [phased behavior, these
> are well-predicted by typical predictors], and direction-dependent transition rate [loop-like behavior
> has 100% transition rate for one direction — not-taken for backward loop branches — and typically
> a lower transition rate for the other direction, typical predictors 'ignore' the high transition rate
> direction]. Even awareness of the value of local history may be useful. [The alternative to saturating
> counter 2-bit predictors has the advantage of providing the previous direction as the hysteresis bit.
> This could be fed into a global predictor. Even the previous direction of spatially nearby branches
> might be more useful information than ordinary global history; if the "per-address" predictor could
> group temporally local branches, the information value density might be increased.])
Yes, I don't know where ARM are trying to go with BC.
I think it's becoming clear that I-prefetching is running out of juice, in that simply prefetching instructions in advance is of limited value if you are then subject to a stream of branch mispredictions in this new code.
One solution is a simply massive L2 branch predictor (the IBM solution) but that's not a great solution, and doesn't solve the context switch problem.
Alternatives could be some sort of stored state somewhere (assume for now that's a solvable problem) in which case being able to store something simple (eg 1 bit predictors that cover 80% of the common cases) might have some value, and maybe that's somehow part of the longterm idea?
Currently, support
> for lower-cost sharing seems to be limited to L1 sharing for multithreaded cores and L2 or L3 sharing within
> a cluster of cores, though network hop distance is also a factor. [I think x86 has support for explicitly writing
> back dirty cache lines to shared cache, but this seems a crude mechanism.])
This is getting into the weeds, but are you sure?
My understanding is that
(a) Intel (and IBM) are MESIF (IBM calls it MERSI) which allows for cache-to-cache transfers of unmodified lines, but not modified lines
(b) Apple is MOESI which does allow cache-to-cache transfer of modified lines (updated in the 2020 new protocol, probably not yet implemented in M1, to be much more aggressive in terms of sideways transfers). I think Apple as of M1 allows cache-to-cache transfers within a cluster, but between clusters is something of a hack; not the full cost of up to DRAM and back, but not as cheap as it will be with the new protocol.
> There is also the factor that not all branch mispredictions are equally expensive. Resolution delay is one factor;
> a branch misprediction resolved twelve cycles after prediction will typically be less expensive than one resolved
> fifty cycles after prediction. A processor might also be able to exploit value and instruction reuse after a
> mispredicted branch to reduce work and/or perform more work in parallel for execution of the corrected path.
> (This might be an aspect where software's broader perspective might be helpful.) Caching branch conditions
> has been proposed for loops, but a similar mechanism might be useful for early branch resolution in a retry;
> even merely predictions might be guided by information from the incorrectly speculated path (a similar method
> has been proposed for transactional memory retries: recording participants and prefetching them).
Apple
- has a predictor as to whether atomic operations (like CAS or store conditional) will succeed.
- this is used in more than one way, but of relevance right now is that it is used to throttle the front-end when appropriate, the idea being that you probably can't change the state that's already in the machine (which will probably have to be flushed if the branch prediction machinery predicted, eg, exiting a CAS loop, whereas the atomic predictor believes you will retry the loop). The atomic predictor also trains the brain predictor, but that's after the fact.
- also relevant to your "recording participants" suggestion, the atomic predictor bases its prediction, if no counter history is available, on things like whether the number to be stored in a "test and set" location is 0 or 1 (assume a semaphore counter, and assume low value means low contention) which I thought was a cute hack that probably works surprisingly well!
> > As regards ISA, the obvious question, then, is what can ISA do to improve this situation?
>
> [snip compiler use of predication and hardware dynamic hammock predication]
>
> Communicating information about branch independence/dependence and perhaps correlation might be useful.
>
> > - a dark horse is ARM's new (v8.8) Branch Consistent (BC) instruction. Does anyone know
> > the backstory behind this (like is it from within ARM or an Apple suggestion or ???)
> > It's somewhat vaguely described (and much may be in the hands of the implementer) but my simple-minded
> > reading of it is that it's a way to move the easy branches (as always, assuming the compiler
> > can get its act together...) out of the fancy branch machinery to much simpler branch machinery,
> > so that the few really hard branches can have much more tech thrown at them.
>
> In addition to potentially allowing trivially predicted branches to use fewer resources,
> this information might also be useful in managing global history. (Of course, a dynamic
> predictor could also detect that a branch is "consistent" and adjust global history.)
>
> The definition of "consistent" is also not entirely clear to me. Is a branch with long phases of the same direction
> consistent? (Such a branch should not be statically predicted but a one-bit predictor would be accurate and trace-like
> optimizations could have better cost-benefit than for shorter-term variability in direction.)
>
> (One could divide local-history-predictable branches into three extreme transition rate classes: high
> transition rate [predict opposite of last is accurate], low transistion rate [phased behavior, these
> are well-predicted by typical predictors], and direction-dependent transition rate [loop-like behavior
> has 100% transition rate for one direction — not-taken for backward loop branches — and typically
> a lower transition rate for the other direction, typical predictors 'ignore' the high transition rate
> direction]. Even awareness of the value of local history may be useful. [The alternative to saturating
> counter 2-bit predictors has the advantage of providing the previous direction as the hysteresis bit.
> This could be fed into a global predictor. Even the previous direction of spatially nearby branches
> might be more useful information than ordinary global history; if the "per-address" predictor could
> group temporally local branches, the information value density might be increased.])
Yes, I don't know where ARM are trying to go with BC.
I think it's becoming clear that I-prefetching is running out of juice, in that simply prefetching instructions in advance is of limited value if you are then subject to a stream of branch mispredictions in this new code.
One solution is a simply massive L2 branch predictor (the IBM solution) but that's not a great solution, and doesn't solve the context switch problem.
Alternatives could be some sort of stored state somewhere (assume for now that's a solvable problem) in which case being able to store something simple (eg 1 bit predictors that cover 80% of the common cases) might have some value, and maybe that's somehow part of the longterm idea?