By: Megol (golem960.delete@this.gmail.com), February 5, 2013 10:16 am
Room: Moderated Discussions
Patrick Chase (patrickjchase.delete@this.gmail.com) on February 4, 2013 10:05 am wrote:
> Jouni Osmala (josmala.delete@this.cc.hut.fi) on February 4, 2013 1:07 am wrote:
> > > Patrick's point, which I agree with, is that the x86 penalty really depends a lot on
> > > context. I think that for a scalar core, it's probably more than 15%. But for something
> > > like the P3, it's a lot less. That's doubly true once you start talking about caches
> > > in the range of 1MB/core. At that point, the x86 penalty is really quite small.
> > >
> > > And it's a fair point, but one I didn't want to dive into
> > > because of the inherent complexity. But it's true,
> > > the x86 overhead depends on the performance of the core; the higher performance the lower the overhead.
> >
> > Unfortunately we are not having alphas around anymore to show that in practice.
> > But key issue is this, once we are in high enough performance situation, we start wondering
> > what we could do to improve it even further. The branch misprediction penalty is one of key things to limit
> > that. And x86 increases the pipeline length. Then the increased
> > register pressure's effect on how many memory
> > subsystem operations you need at given performance level. Of course intel has spend hardware there to fix
> > performance problems, hardware that consumes power and that if applied to riscier system could instead of
> > filling buffers with superfluous memory operations could spend them with operations inherent in problem.
> > Then there is just penalty of having far wider micro-ops
> > stored in OoO structures compared to RISCier designs.
> > And renaming of condition codes still require hardware that needs to be active.
> > The problem isn't how much die area they spend anymore, its where the extra die area adds
> > latency or consumes power, and that's where x86 penalty really comes in these days.
>
> You can mitigate such decode power penalties by using a first-level Icache that contains
> uops instead of instructions. It only works if you're using physical register files (as opposed
> to reservation stations), though, because otherwise the uop size is unmanageable.
>
> Interestingly enough, every Intel OoO CPU with PRFs has had some form of first-level uop cache (P4,
> Sandy Bridge, Ivy Bridge, Haswell). Do you think that perhaps they've figured this one out? :-)
The µop cache is on the same level as the decoders and are followed by a decode buffer and then renaming. I can't see how a stateful ROB would be a problem in such a design as the µops have maybe 5 bit register fields (to contain architectural registers and µcode temporal registers) no matter if the reorder mechanism uses a stateful ROB, a PRF or distributed reservation stations.
To make any difference the µop cache have to store pre-renamed registers/fields but while that have been proposed in some papers no implementation of any architecture I know of have used anything like that.
> Jouni Osmala (josmala.delete@this.cc.hut.fi) on February 4, 2013 1:07 am wrote:
> > > Patrick's point, which I agree with, is that the x86 penalty really depends a lot on
> > > context. I think that for a scalar core, it's probably more than 15%. But for something
> > > like the P3, it's a lot less. That's doubly true once you start talking about caches
> > > in the range of 1MB/core. At that point, the x86 penalty is really quite small.
> > >
> > > And it's a fair point, but one I didn't want to dive into
> > > because of the inherent complexity. But it's true,
> > > the x86 overhead depends on the performance of the core; the higher performance the lower the overhead.
> >
> > Unfortunately we are not having alphas around anymore to show that in practice.
> > But key issue is this, once we are in high enough performance situation, we start wondering
> > what we could do to improve it even further. The branch misprediction penalty is one of key things to limit
> > that. And x86 increases the pipeline length. Then the increased
> > register pressure's effect on how many memory
> > subsystem operations you need at given performance level. Of course intel has spend hardware there to fix
> > performance problems, hardware that consumes power and that if applied to riscier system could instead of
> > filling buffers with superfluous memory operations could spend them with operations inherent in problem.
> > Then there is just penalty of having far wider micro-ops
> > stored in OoO structures compared to RISCier designs.
> > And renaming of condition codes still require hardware that needs to be active.
> > The problem isn't how much die area they spend anymore, its where the extra die area adds
> > latency or consumes power, and that's where x86 penalty really comes in these days.
>
> You can mitigate such decode power penalties by using a first-level Icache that contains
> uops instead of instructions. It only works if you're using physical register files (as opposed
> to reservation stations), though, because otherwise the uop size is unmanageable.
>
> Interestingly enough, every Intel OoO CPU with PRFs has had some form of first-level uop cache (P4,
> Sandy Bridge, Ivy Bridge, Haswell). Do you think that perhaps they've figured this one out? :-)
The µop cache is on the same level as the decoders and are followed by a decode buffer and then renaming. I can't see how a stateful ROB would be a problem in such a design as the µops have maybe 5 bit register fields (to contain architectural registers and µcode temporal registers) no matter if the reorder mechanism uses a stateful ROB, a PRF or distributed reservation stations.
To make any difference the µop cache have to store pre-renamed registers/fields but while that have been proposed in some papers no implementation of any architecture I know of have used anything like that.