Exophase ( on August 16, 2016 11:02 am wrote:
> Ricardo B (ricardo.b.delete@this.xxxxx.xx) on August 16, 2016 10:33 am wrote:
> > The other problem is the following. If I understood you correctly, you were proposing something like
> > mov reg, hint # Store hint into reg
> > ... lots of instructions later ...
> > jmp reg, label # Jump with hint
> >
> > The problem is that in a big OoO core like Skylake, the front end has no idea
> > about the state of the execution of the "mov reg, hint" instruction.
> > For all we know, "reg" may not have been assigned a physical register yet, or
> > it's physical register hasn't been written (thus has some garbage value).
> >
> > And the sort of logic that is needed to sort this state out is big and complex: that's
> > why CPUs have so many pipeline stages, to split this sort of complex logic.
> >
> > Deeply pipelined CPUs is a need which arises from the need
> > to split complex logic over multiple clock stages.
> > Branch prediction is a need which arises from pipelines, but branch prediction logic
> > needs to be kept simple so it can work in a single (or few) clock stages.
> > If a branch prediction algorithm cannot be implemented in fast logic, it's worthless. We'd
> > be better of by stalling the pipeline until the branch instruction is actually resolved.
> >
> An alternative is something like this:

> jnz_hint target
> ...
> target:
> jnz label

> Here the jnz_hint instruction updates the branch prediction state. If the condition is
> true it's update to predict that the branch will be taken, otherwise not taken. Similar
> could be done with an instruction that pre-loads the target for an indirect branch.
> There are still some pretty obvious problems with this. Because the hint is executed in the back-end it's very
> difficult to predict how many instructions ahead of its target it needs to be. If it comes too late it will
> end up predicting the branch for the wrong iteration which could make the result worse than nothing. And you'll
> need to keep a lot of dependencies live in order to perform the branch computation twice. On archs like x86
> it's going to be hard to keep flags state intact so you're liable to have to repeat comparisons.
> I also don't know realistic it is to have this part of the back-end update the branch prediction state,
> although other parts will anyway as part of the natural branch prediction and correction process.
> In practice I doubt anything like this would actually be used to supplement high
> end branch prediction hardware. Only on uarchs like Cell's SPE that lacked hardware
> branch prediction altogether would you find real use for something like this.

As purely I'm-bored-waiting-for-software-to-run exercise:
ISA would have a bank of 1 bit prediction registers bpreg* and instructions to manipulate them
Use would be as:
write bpregX, hint
je bpregX, target

In order to keep things simple, execution of these instructions would be along these lines:
If we have writes to bpregX in flight, branches which use bpregX as hint will either stall fetch until these writes complete or ignore the hint.

This would mean only a bus of bpreg (real, commited) values would need to be fed to the fetch unit.

This might be feasible.

