By: Michael S (already5chosen.delete@this.yahoo.com), February 4, 2013 7:48 am
Room: Moderated Discussions
Jouni Osmala (josmala.delete@this.cc.hut.fi) on February 4, 2013 1:07 am wrote:
> > Patrick's point, which I agree with, is that the x86 penalty really depends a lot on
> > context. I think that for a scalar core, it's probably more than 15%. But for something
> > like the P3, it's a lot less. That's doubly true once you start talking about caches
> > in the range of 1MB/core. At that point, the x86 penalty is really quite small.
> >
> > And it's a fair point, but one I didn't want to dive into
> > because of the inherent complexity. But it's true,
> > the x86 overhead depends on the performance of the core; the higher performance the lower the overhead.
>
> Unfortunately we are not having alphas around anymore to show that in practice.
> But key issue is this, once we are in high enough performance situation, we start wondering
> what we could do to improve it even further. The branch misprediction penalty is one of key things to limit
> that. And x86 increases the pipeline length.
At the low-performance part of the curve x86 tax could be easily in range of 30 to 100%.
On the other hand, 2-3 clock increase in branch mispredict penalty (don't forget, 2-3 clocks out 12-20 clocks) accounts, may be, for 1% decrease in performance, but likely even less. Not quite a u-turn.
> Then the increased register pressure's effect on how many memory
> subsystem operations you need at given performance level. Of course intel has spend hardware there to fix
> performance problems, hardware that consumes power and that if applied to riscier system could instead of
> filling buffers with superfluous memory operations could spend them with operations inherent in problem.
There certainly exist problems which run-time is dominated by L1D hit throughput. But they are not very common, to say the least. And for those, I'd think, the relative percentage of memory accesses caused by register starvation is lower than average.
I.e. again, we are talking about single-digit impact on performance, even in the worst case.
> Then there is just penalty of having far wider micro-ops stored in OoO structures compared to RISCier designs.
I am not sure it is necessary, or even that it is how things done in the new generation of PRF-based x86 cores. There are indications (L1D load latency) that SandyB stores only 11 or 12 immediate bits in reservation stations, that's less than your typical RISC. The rest of the fields should be similar to RISC.
> And renaming of condition codes still require hardware that needs to be active.
It performs useful work by taking part of the pressure off the GPR-related parts of design.
Anyway, pure GPR-only RISCs no more exist in high performance range. All 5 existing high-performance general-purpose ISAs (IPF/Power/SPARC/x86/zArch), as well as a single newcomer (Arm), hold comparison results either as condition codes or in special-purpose register file.
> The problem isn't how much die area they spend anymore, its where the extra die area adds
> latency or consumes power, and that's where x86 penalty really comes in these days.
Except, in practice the difference is lost in noise of other factors.
> > Patrick's point, which I agree with, is that the x86 penalty really depends a lot on
> > context. I think that for a scalar core, it's probably more than 15%. But for something
> > like the P3, it's a lot less. That's doubly true once you start talking about caches
> > in the range of 1MB/core. At that point, the x86 penalty is really quite small.
> >
> > And it's a fair point, but one I didn't want to dive into
> > because of the inherent complexity. But it's true,
> > the x86 overhead depends on the performance of the core; the higher performance the lower the overhead.
>
> Unfortunately we are not having alphas around anymore to show that in practice.
> But key issue is this, once we are in high enough performance situation, we start wondering
> what we could do to improve it even further. The branch misprediction penalty is one of key things to limit
> that. And x86 increases the pipeline length.
At the low-performance part of the curve x86 tax could be easily in range of 30 to 100%.
On the other hand, 2-3 clock increase in branch mispredict penalty (don't forget, 2-3 clocks out 12-20 clocks) accounts, may be, for 1% decrease in performance, but likely even less. Not quite a u-turn.
> Then the increased register pressure's effect on how many memory
> subsystem operations you need at given performance level. Of course intel has spend hardware there to fix
> performance problems, hardware that consumes power and that if applied to riscier system could instead of
> filling buffers with superfluous memory operations could spend them with operations inherent in problem.
There certainly exist problems which run-time is dominated by L1D hit throughput. But they are not very common, to say the least. And for those, I'd think, the relative percentage of memory accesses caused by register starvation is lower than average.
I.e. again, we are talking about single-digit impact on performance, even in the worst case.
> Then there is just penalty of having far wider micro-ops stored in OoO structures compared to RISCier designs.
I am not sure it is necessary, or even that it is how things done in the new generation of PRF-based x86 cores. There are indications (L1D load latency) that SandyB stores only 11 or 12 immediate bits in reservation stations, that's less than your typical RISC. The rest of the fields should be similar to RISC.
> And renaming of condition codes still require hardware that needs to be active.
It performs useful work by taking part of the pressure off the GPR-related parts of design.
Anyway, pure GPR-only RISCs no more exist in high performance range. All 5 existing high-performance general-purpose ISAs (IPF/Power/SPARC/x86/zArch), as well as a single newcomer (Arm), hold comparison results either as condition codes or in special-purpose register file.
> The problem isn't how much die area they spend anymore, its where the extra die area adds
> latency or consumes power, and that's where x86 penalty really comes in these days.
Except, in practice the difference is lost in noise of other factors.