By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), February 17, 2013 5:28 pm
Room: Moderated Discussions
Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on February 17, 2013 12:30 pm wrote:
[snip]
> No, it can't ever be faster. You need 4 instructions whether you expand the sequence using logical operations
> and a branch, 2 compares and branches or 2 compares with a condition combine instruction and a branch.
>
> However conditional execution needs 2 or 3 instructions for this, and combined
> compare+branch needs just 2 so those are the fastest ways on in-order CPUs.
Itanium-style compare and OR or AND with a CC could bring the CC version to three instructions--and would not need a second CC register. (For boolean conditions, such can allow removal of the comparison instruction when the earlier condition prevents the ORed or ANDed comparison from influencing the result--not that such a hardware optimization would necessarily be worthwhile!)
I find multiple condition registers attractive partially because such can be less expensive front-end registers (future file for an OoO implementation) allowing early branch resolution. In theory, branches whose conditions are set early could be resolved just after branch prediction (the predictor could indicate that the branch is a candidate for early resolution and what condition register+condition is tested), but even just after decode branch resolution could have some benefits.
Admittedly, the scope for potential improvement (conditions that can be set early) is very limited and lack of compiler support is probably a significant issue (in some cases, early setting would require interprocedural optimization). The scope is perhaps slightly larger than one might expect, however, since after a branch misprediction the current at time of mispredicted branch values are likely to be in the front end by decode time for the redirected instruction stream.
In addition, the benefit of early resolution over correct branch prediction may be small and many branches that can be resolved early may be highly predictable. (Accurate estimates of prediction confidence would further reduce the benefit of early resolution. A high confidence branch could use low-performance misprediction recovery [e.g., releasing ROB entries and requiring re-execution]; a low confidence branch might use predication or reduced priority [in a single-thread mode such might reduce power consumption at the cost of performance].)
This kind of optimization is what motivated Sheikh et al.'s "Control-Flow Decoupling" (MICRO 2012).
(Along similar lines, providing a few bits of information in the front end about a counter would allow early branch resolution [this would be a bit like the stack pointer optimization of recent x86], potentially resolving the branch before instruction fetch using a this-is-a-counted-inner-loop predictor with taken/not-taken/unknown state derived from the small front-end subset of the counter. This would require a way for the processor to determine that a value is a counter, but this need not actually be an architectural designation [though architecturally specifying a register--it need not be a special-purpose register as in Power and Itanium--would allow decrement-test-and-branch instructions].)
Multithreading may make early branch resolution more common (by reducing the effective pipeline delay between when a condition is generated and when it would be used in the front end) as well as less inferior to branch prediction in terms of delay (if prediction was four stages before resolution, it might appear as only two cycles with two-way multithreading).
I am not confident that multiple condition registers are the best design, but the trade-offs do not seem entirely clear.
[snip]
> No, it can't ever be faster. You need 4 instructions whether you expand the sequence using logical operations
> and a branch, 2 compares and branches or 2 compares with a condition combine instruction and a branch.
>
> However conditional execution needs 2 or 3 instructions for this, and combined
> compare+branch needs just 2 so those are the fastest ways on in-order CPUs.
Itanium-style compare and OR or AND with a CC could bring the CC version to three instructions--and would not need a second CC register. (For boolean conditions, such can allow removal of the comparison instruction when the earlier condition prevents the ORed or ANDed comparison from influencing the result--not that such a hardware optimization would necessarily be worthwhile!)
I find multiple condition registers attractive partially because such can be less expensive front-end registers (future file for an OoO implementation) allowing early branch resolution. In theory, branches whose conditions are set early could be resolved just after branch prediction (the predictor could indicate that the branch is a candidate for early resolution and what condition register+condition is tested), but even just after decode branch resolution could have some benefits.
Admittedly, the scope for potential improvement (conditions that can be set early) is very limited and lack of compiler support is probably a significant issue (in some cases, early setting would require interprocedural optimization). The scope is perhaps slightly larger than one might expect, however, since after a branch misprediction the current at time of mispredicted branch values are likely to be in the front end by decode time for the redirected instruction stream.
In addition, the benefit of early resolution over correct branch prediction may be small and many branches that can be resolved early may be highly predictable. (Accurate estimates of prediction confidence would further reduce the benefit of early resolution. A high confidence branch could use low-performance misprediction recovery [e.g., releasing ROB entries and requiring re-execution]; a low confidence branch might use predication or reduced priority [in a single-thread mode such might reduce power consumption at the cost of performance].)
This kind of optimization is what motivated Sheikh et al.'s "Control-Flow Decoupling" (MICRO 2012).
(Along similar lines, providing a few bits of information in the front end about a counter would allow early branch resolution [this would be a bit like the stack pointer optimization of recent x86], potentially resolving the branch before instruction fetch using a this-is-a-counted-inner-loop predictor with taken/not-taken/unknown state derived from the small front-end subset of the counter. This would require a way for the processor to determine that a value is a counter, but this need not actually be an architectural designation [though architecturally specifying a register--it need not be a special-purpose register as in Power and Itanium--would allow decrement-test-and-branch instructions].)
Multithreading may make early branch resolution more common (by reducing the effective pipeline delay between when a condition is generated and when it would be used in the front end) as well as less inferior to branch prediction in terms of delay (if prediction was four stages before resolution, it might appear as only two cycles with two-way multithreading).
I am not confident that multiple condition registers are the best design, but the trade-offs do not seem entirely clear.