By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), February 19, 2013 5:52 pm
Room: Moderated Discussions
EBFE (x.delete@this.y.com) on February 19, 2013 1:03 am wrote:
[snip]
> By "early branch resolution" you mean that the branch simply doesn't predict and stall for the condition?
No, I was thinking of using early availability of a branch condition to potentially override any prediction. However, in some (quite possibly rare) cases, stalling may be a better option than prediction or dynamic predication. Fine-grained or simultaneous multithreading can reduce the cost of such a stall. If a high-latency load severely constrains the extractable parallelism, stalling an unpredictable branch could be more efficient than the alternatives and possibly even higher performance (depending on the cost of a misprediction).
(As I think I mentioned already, early resolution can also have benefits similar to good confidence estimation. Knowing that a path is correct [even if excluding exceptions] can facilitate more optimal allocation of resources [e.g., possibly freeing registers early] or accelerate the commitment of a memory transaction [even in a non-TM system one could conceive of forwarding store buffer values to a requester speculatively when the probability of the stores not actually occurring is tiny and revocation is possible].
> And its candidate usecase is limited to high-misprediction (or
> use prediction), not-in-loop-body (or use CFD) branches.
> And compiler has to shuffle code to fill a potentially large and unpredictable
> bubble, which IMO is unlikely to find that much work to fill.
I agree that the use cases seem to be relatively limited. However, some unpredictable branches are not friendly to predication (especially multi-way and indirect branches) and control flow decoupling seems too novel to be adopted as a reasonably safe design point.
(I think I already mentioned that branches that can be resolved early might also tend to be relatively predictable anyway, reducing the benefit of early resolution.)
I suspect that in general a reorganization of code which would hurt performance (or power efficiency) under perfect prediction would be undesirable, so the question would largely be about the choice of how much compile time should be allocated. The size of the bubble is somewhat less important if earlier resolution always notices mispredictions earlier. Assuming the compiler is targeting good performance on multiple likely implementations, a moderately large bubble assumption might be appropriate (processors that usually experience a larger bubble are likely to have better predictors and the benefit of early condition setting is less likely to be worth the compiler effort and possible performance harm).
David I. August et al.'s "Architectural Support Compiler-Synthesized Dynamic Branch Prediction Strategies: Rationale and Initial Results" (1997) has a graph (Figure 2, page 2) where the average among several benchmarks was 30% of branches (ranging from under 5% to over 90%--for word count!) could be resolved 6 cycles early (on a 4-wide machine). Of course, that was also assuming aggressive scheduling and predication (which would tend to add "useless" instructions and increase the flexibility of scheduling), so the potential for a less EPIC ISA might be substantially lower. However, I think that an aggressive compiler could do better than what was typically done for delayed branches (IIRC, filling two delay slots was difficult).
(A similar problem applies to load scheduling and prefetching. Just because the targeted processors have extensive OoO capability and fast caches with good hardware prefetching does not mean that loads should be scheduled immediately before use or that prefetching is necessarily completely useless. Admittedly, there may be more profitable efforts for the compiler writer than trying to do clever scheduling for a small to minuscule benefit. Perhaps in open source development there might be some who would enjoy such scheduling problems more than algorithmic transformation or data layout optimizations.)
I would also note that, despite what Linus Torvalds seems to think, not all processors are Great Big OoO processors. For the low end (say, 3 to 5 pipeline stages, scalar) multiple condition codes would mainly only make sense if certain condition values are relatively persistent (where such might save some energy, otherwise a single condition code is likely quite adequate to provide early branch resolution). A bit higher in performance (say, up to 9 pipeline stages, two-wide, in-order) multiple condition codes could have a performance benefit. Even higher performance (9+ pipeline stages, 2+ issue, OoO) might make multiple condition codes a slight disadvantage (considering the extra complexity in renaming and data flow arranging). At the highest performance, multiple condition codes might not make much difference.
Early condition code setting seems to be kind of like delayed branches without the need to architect the delay; it can benefit some lower end designs, is a bit painful in the middle, and in the noise at the high end.
[snip]
> By "early branch resolution" you mean that the branch simply doesn't predict and stall for the condition?
No, I was thinking of using early availability of a branch condition to potentially override any prediction. However, in some (quite possibly rare) cases, stalling may be a better option than prediction or dynamic predication. Fine-grained or simultaneous multithreading can reduce the cost of such a stall. If a high-latency load severely constrains the extractable parallelism, stalling an unpredictable branch could be more efficient than the alternatives and possibly even higher performance (depending on the cost of a misprediction).
(As I think I mentioned already, early resolution can also have benefits similar to good confidence estimation. Knowing that a path is correct [even if excluding exceptions] can facilitate more optimal allocation of resources [e.g., possibly freeing registers early] or accelerate the commitment of a memory transaction [even in a non-TM system one could conceive of forwarding store buffer values to a requester speculatively when the probability of the stores not actually occurring is tiny and revocation is possible].
> And its candidate usecase is limited to high-misprediction (or
> use prediction), not-in-loop-body (or use CFD) branches.
> And compiler has to shuffle code to fill a potentially large and unpredictable
> bubble, which IMO is unlikely to find that much work to fill.
I agree that the use cases seem to be relatively limited. However, some unpredictable branches are not friendly to predication (especially multi-way and indirect branches) and control flow decoupling seems too novel to be adopted as a reasonably safe design point.
(I think I already mentioned that branches that can be resolved early might also tend to be relatively predictable anyway, reducing the benefit of early resolution.)
I suspect that in general a reorganization of code which would hurt performance (or power efficiency) under perfect prediction would be undesirable, so the question would largely be about the choice of how much compile time should be allocated. The size of the bubble is somewhat less important if earlier resolution always notices mispredictions earlier. Assuming the compiler is targeting good performance on multiple likely implementations, a moderately large bubble assumption might be appropriate (processors that usually experience a larger bubble are likely to have better predictors and the benefit of early condition setting is less likely to be worth the compiler effort and possible performance harm).
David I. August et al.'s "Architectural Support Compiler-Synthesized Dynamic Branch Prediction Strategies: Rationale and Initial Results" (1997) has a graph (Figure 2, page 2) where the average among several benchmarks was 30% of branches (ranging from under 5% to over 90%--for word count!) could be resolved 6 cycles early (on a 4-wide machine). Of course, that was also assuming aggressive scheduling and predication (which would tend to add "useless" instructions and increase the flexibility of scheduling), so the potential for a less EPIC ISA might be substantially lower. However, I think that an aggressive compiler could do better than what was typically done for delayed branches (IIRC, filling two delay slots was difficult).
(A similar problem applies to load scheduling and prefetching. Just because the targeted processors have extensive OoO capability and fast caches with good hardware prefetching does not mean that loads should be scheduled immediately before use or that prefetching is necessarily completely useless. Admittedly, there may be more profitable efforts for the compiler writer than trying to do clever scheduling for a small to minuscule benefit. Perhaps in open source development there might be some who would enjoy such scheduling problems more than algorithmic transformation or data layout optimizations.)
I would also note that, despite what Linus Torvalds seems to think, not all processors are Great Big OoO processors. For the low end (say, 3 to 5 pipeline stages, scalar) multiple condition codes would mainly only make sense if certain condition values are relatively persistent (where such might save some energy, otherwise a single condition code is likely quite adequate to provide early branch resolution). A bit higher in performance (say, up to 9 pipeline stages, two-wide, in-order) multiple condition codes could have a performance benefit. Even higher performance (9+ pipeline stages, 2+ issue, OoO) might make multiple condition codes a slight disadvantage (considering the extra complexity in renaming and data flow arranging). At the highest performance, multiple condition codes might not make much difference.
Early condition code setting seems to be kind of like delayed branches without the need to architect the delay; it can benefit some lower end designs, is a bit painful in the middle, and in the noise at the high end.