By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), February 23, 2013 8:09 am
Room: Moderated Discussions
hobold (hobold.delete@this.vectorizer.org) on February 20, 2013 2:02 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on February 19, 2013 5:52 pm wrote:
> > EBFE (x.delete@this.y.com) on February 19, 2013 1:03 am wrote:
> > [snip]
> > > By "early branch resolution" you mean that the branch simply doesn't predict and stall for the condition?
> >
> > No, I was thinking of using early availability of a branch
> > condition to potentially override any prediction.
>
> This has been done in a family line of short pipelined (4 or 5 stages in total, IIRC) PowerPC processors.
> The model numbers were PPC603, PPC603e, PPC740, PPC750, PPC7400, PPC7410. I think some of those still
> live on as embedded core line, under names like "e300". All these processors could resolve conditional
> branches without prediction when the respective condition register field had no updates pending as the
> branch was encountered. If there were updates in flight, the branch was predicted as usual.
I think it was a description of the PowerPC750 (Apple G3) that encouraged me to think along these lines. Unfortunately, that implementation tends to confirm Linus Torvalds view that such tricks tend to divert attention from more generally useful implementation techniques. If I recall correctly, the PPC750 delayed prediction and fetching from a predicted path until the branch was decoded. Even with the Branch Target Instruction Cache, this could introduce a fetch bubble. In addition the PPC750 was only weakly OoO, and that implementation choice may have been influenced by the desire to resolve branches early (without renaming condition registers).
Note: the PPC750 was intended to be simple/small/low-cost (and low-power), so the design choices were not solely influenced by a desire for early branch resolution.
> There were contemporary processor models by the same manufacturers with longer pipelines (and
> higher clock frequencies and generally higher performance) that did not even try to do this trick.
> Apparently it is not too effective as the machines get wider and deeper. It worked nicely for
> the simpler microarchitectures, though, and helped them to conserve energy as well.
One problem that PowerPC seems to have had was that compilers would not take advantage of the multiple condition registers (nor the multiple bits per condition register)--I think Maynard Handley mentioned this on comp.arch some time ago. Even early loading of the count register might not have been aggressively sought. (In nested count-based loops, it would be easy to set the last iteration condition of the outer loop early and use the count register for the inner loop.) If compilers do not exploit a feature, there is little reason to develop microarchitectures that benefit from the compiler's work.
(This seems somewhat related to the tendency of x86 not to rename segment base registers. If such are rarely modified--as with current code--, then the effort to support renaming is mostly wasted.)
(Even with a deep pipeline setting multiple condition registers early can have a benefit in that on a branch misprediction the front-end can be updated with current values allowing later branches to be resolved early. [Although in some cases it would be theoretically possible for the front-end to learn about the conditions from branches in the incorrect path, correlating such branch conditions determined in the false path with branches in the correct path and guessing whether they would be set the same in the correct path would be difficult. Such would also only provide a--hopefully very accurate--prediction, not early resolution.])
Even with an effective compiler, I suspect that high performance microarchitectures would not benefit much. (On the other hand, even a small improvement at the high end can be significant, though I suspect the complexity budget could be better spent elsewhere.)
It seems that it would be desirable for the compiler/programmer to be able to exploit its knowledge. Branches which are highly predictable or whose conditions cannot easily be set early might use compare against register or a frequently set condition register (which is set just before the branch). Branches whose conditions can be set early could use condition/predicate registers that are located/copied in the front end. (Likewise, distinguishing predication conditions could be useful.)
Unfortunately, compiler optimizations for one microarchitecture might be (moderate) pessimizations for another microarchitecture.
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on February 19, 2013 5:52 pm wrote:
> > EBFE (x.delete@this.y.com) on February 19, 2013 1:03 am wrote:
> > [snip]
> > > By "early branch resolution" you mean that the branch simply doesn't predict and stall for the condition?
> >
> > No, I was thinking of using early availability of a branch
> > condition to potentially override any prediction.
>
> This has been done in a family line of short pipelined (4 or 5 stages in total, IIRC) PowerPC processors.
> The model numbers were PPC603, PPC603e, PPC740, PPC750, PPC7400, PPC7410. I think some of those still
> live on as embedded core line, under names like "e300". All these processors could resolve conditional
> branches without prediction when the respective condition register field had no updates pending as the
> branch was encountered. If there were updates in flight, the branch was predicted as usual.
I think it was a description of the PowerPC750 (Apple G3) that encouraged me to think along these lines. Unfortunately, that implementation tends to confirm Linus Torvalds view that such tricks tend to divert attention from more generally useful implementation techniques. If I recall correctly, the PPC750 delayed prediction and fetching from a predicted path until the branch was decoded. Even with the Branch Target Instruction Cache, this could introduce a fetch bubble. In addition the PPC750 was only weakly OoO, and that implementation choice may have been influenced by the desire to resolve branches early (without renaming condition registers).
Note: the PPC750 was intended to be simple/small/low-cost (and low-power), so the design choices were not solely influenced by a desire for early branch resolution.
> There were contemporary processor models by the same manufacturers with longer pipelines (and
> higher clock frequencies and generally higher performance) that did not even try to do this trick.
> Apparently it is not too effective as the machines get wider and deeper. It worked nicely for
> the simpler microarchitectures, though, and helped them to conserve energy as well.
One problem that PowerPC seems to have had was that compilers would not take advantage of the multiple condition registers (nor the multiple bits per condition register)--I think Maynard Handley mentioned this on comp.arch some time ago. Even early loading of the count register might not have been aggressively sought. (In nested count-based loops, it would be easy to set the last iteration condition of the outer loop early and use the count register for the inner loop.) If compilers do not exploit a feature, there is little reason to develop microarchitectures that benefit from the compiler's work.
(This seems somewhat related to the tendency of x86 not to rename segment base registers. If such are rarely modified--as with current code--, then the effort to support renaming is mostly wasted.)
(Even with a deep pipeline setting multiple condition registers early can have a benefit in that on a branch misprediction the front-end can be updated with current values allowing later branches to be resolved early. [Although in some cases it would be theoretically possible for the front-end to learn about the conditions from branches in the incorrect path, correlating such branch conditions determined in the false path with branches in the correct path and guessing whether they would be set the same in the correct path would be difficult. Such would also only provide a--hopefully very accurate--prediction, not early resolution.])
Even with an effective compiler, I suspect that high performance microarchitectures would not benefit much. (On the other hand, even a small improvement at the high end can be significant, though I suspect the complexity budget could be better spent elsewhere.)
It seems that it would be desirable for the compiler/programmer to be able to exploit its knowledge. Branches which are highly predictable or whose conditions cannot easily be set early might use compare against register or a frequently set condition register (which is set just before the branch). Branches whose conditions can be set early could use condition/predicate registers that are located/copied in the front end. (Likewise, distinguishing predication conditions could be useful.)
Unfortunately, compiler optimizations for one microarchitecture might be (moderate) pessimizations for another microarchitecture.