By: Maynard Handley (name99.delete@this.name99.org), April 29, 2017 9:01 pm
Room: Moderated Discussions
Exophase (exophase.delete@this.gmail.com) on April 29, 2017 9:22 pm wrote:
> Megol (golem960.delete@this.gmail.com) on April 29, 2017 2:18 pm wrote:
> > If that was true there's no need for branch prediction in in-order processors as the
> > processor would soon stall anyway waiting for register writes delayed by branch resolvement.
> > Now look at high-performance in-order designs, do they use branch prediction?
> >
>
> Those register writes won't happen any sooner than the branch's resolution on an in-order processor.
>
> > > and specifically about out-of-order processors:
> > >
> >
> > That should be obvious for anyone that try to keep up to date with computer research. Doesn't change
> > the basic facts, just the multiplier of how costly things are. And they cause other problems.
>
> It doesn't seem obvious to me. Every OoO uarch whose details I'm aware of fills and retires
> the ROB in-order. An instruction can only be retired from the ROB if all instructions behind
> it have retired. How can you track this if you're only putting some instructions on it?
>
There are multiple ways to skin a cat...
What is the ULTIMATE goal of the ROB? This is not completely obvious.
The obvious answer is to support speculation, or more precisely to recover from incorrect speculation. If that is the goal, then what sort of speculation are you trying to recover from? The usual answer is control speculation, and if THAT is the case, then you don't need the checkpoint that the ROB provides for EVERY instruction, you only need a checkpoint at the points of control speculation. (And you don't even need all of those! You can just put checkpoints at the branches that are associated with low-confidence, and on misspeculation wind back to the nearest such checkpoint.)
So that's one answer. But then there is the question of precise interrupts. Do you care about them, and under what conditions? Do you have other forms of speculation and can you unwind them just by replay?
So there are many aspects to the question depending on precisely what you are doing, but the LARGE answer is that --- you only need a way to "recover from problems" to some safe point in the past, you don't have to be able to recover to EVERY instruction. And so how can you usefully exploit this fact?
One fairly recent summary of the options can be found in this (2015) PhD thesis (which is, IMHO, should be of general interest --- most of the issues people keep complaining about here as non-scalable aspects of "OoO" are addressed --- that is after all what you would expect from a work titled "Efficient Scaling of Out-of-Order Processor Resources"...):
https://idea.library.drexel.edu/islandora/object/idea%3A6322/datastream/OBJ/download/Efficient_Scaling_of_Out-of-Order_Processor_Resources.pdf,
> Megol (golem960.delete@this.gmail.com) on April 29, 2017 2:18 pm wrote:
> > If that was true there's no need for branch prediction in in-order processors as the
> > processor would soon stall anyway waiting for register writes delayed by branch resolvement.
> > Now look at high-performance in-order designs, do they use branch prediction?
> >
>
> Those register writes won't happen any sooner than the branch's resolution on an in-order processor.
>
> > > and specifically about out-of-order processors:
> > >
- You don't need a ROB entry for every instruction, only for those which can mispredict (branches)
> > > or raise exceptions (loads, stores, divides if your architecture is badly designed). Yes, you
> > > need the rename map update information to unroll that from other intermediate instructions, but
> > > that's smaller than you'd think due to values having a high rate of infant mortality.
> > > - You don't even need a ROB entry for every potentially faulting instruction. Most loads
> > > and stores don't fault, so you can batch them together into one atomically faulting unit.
> > > If they do indeed fault, you flush and go into a "limp home mode" so you can precisely
> > > fault one. Exceptions are slow anyway so it isn't a major performance issue.
> > > - You don't even need to cover L2 latency with your ROB - just TLB latency. Nobody does precise bus faults
> > > anyway, so you can pop loads/stores off the ROB once they've been translated. Yes, if you do lots of TLB
> > > misses, your ROB will fill up quicker than you expected, but your performance was screwed anyway
> > >
> > >
> >
> > That should be obvious for anyone that try to keep up to date with computer research. Doesn't change
> > the basic facts, just the multiplier of how costly things are. And they cause other problems.
>
> It doesn't seem obvious to me. Every OoO uarch whose details I'm aware of fills and retires
> the ROB in-order. An instruction can only be retired from the ROB if all instructions behind
> it have retired. How can you track this if you're only putting some instructions on it?
>
There are multiple ways to skin a cat...
What is the ULTIMATE goal of the ROB? This is not completely obvious.
The obvious answer is to support speculation, or more precisely to recover from incorrect speculation. If that is the goal, then what sort of speculation are you trying to recover from? The usual answer is control speculation, and if THAT is the case, then you don't need the checkpoint that the ROB provides for EVERY instruction, you only need a checkpoint at the points of control speculation. (And you don't even need all of those! You can just put checkpoints at the branches that are associated with low-confidence, and on misspeculation wind back to the nearest such checkpoint.)
So that's one answer. But then there is the question of precise interrupts. Do you care about them, and under what conditions? Do you have other forms of speculation and can you unwind them just by replay?
So there are many aspects to the question depending on precisely what you are doing, but the LARGE answer is that --- you only need a way to "recover from problems" to some safe point in the past, you don't have to be able to recover to EVERY instruction. And so how can you usefully exploit this fact?
One fairly recent summary of the options can be found in this (2015) PhD thesis (which is, IMHO, should be of general interest --- most of the issues people keep complaining about here as non-scalable aspects of "OoO" are addressed --- that is after all what you would expect from a work titled "Efficient Scaling of Out-of-Order Processor Resources"...):
https://idea.library.drexel.edu/islandora/object/idea%3A6322/datastream/OBJ/download/Efficient_Scaling_of_Out-of-Order_Processor_Resources.pdf,