By: anon (anon.delete@this.anon.com), November 12, 2014 8:37 pm
Room: Moderated Discussions
Ricardo B (ricardo.b.delete@this.xxxxx.xx) on November 12, 2014 12:51 pm wrote:
> anon (anon.delete@this.anon.com) on November 12, 2014 5:48 am wrote:
>
> > I suppose single wide decoder was tied to the trace cache. If they used a two wide decoder,
> > would they have even needed the trace cache? The first Pentium 4 was very close to 2/3 the
> > IPC of the Pentium III of the same era on SPECint2000. And actually the decoder was not
> > widened to 4 until much later, although it was made more capable in other ways.
> >
> > But this just raises the question, why did they do the trace cache at all? I think a single-wide x86
> > decoder on a high(er) latency path must have looked mighty attractive to take such a big risk.
>
> Fast wide x86 decoding is not cheap.
Well the assertion I replied to was that a two-wide decoder would not have been very costly.
>
> Intel's previous processors, P-Pro/P-II/P-III, had a moderately wide and somewhat fragile
> decoding logic: 3 decoders, of which only two could only decode simple instructions.
Actually I think 3 could decode simple. Only 1 could decode complex, and microcode went to another path.
But IPC of Pentium4 was 2/3 IPC of PentiumIII, so a 2-wide decoder should have been sufficient.
> And a lot of critical loops would become decode limited if not properly scheduled.
>
> Netburst's trace cache was meant as a way to bypass these issues.
> In theory, it would provide robust high bandwidth instruction
> fetch without the need for wide generic x86 decoding logic.
> In practice, if didn't work out so well.
>
> Only after Netburst Intel began improving x86 decoding, with the introduction of µOP fusion in the Banias.
I'm not sure what you mean. x86 decoding has been improved in every generation of Intel microarchitectures before P4.
>
> And still, with Sandy Bridge Intel sucessfully ressurected
> the concept µOP caching to bypass the x86 decoders.
Even earlier with loop buffer too.
> This time it's was paired with robust and wide x86 decoders, but it still
> provides gains in power, bandwidth and branch misprediction penalty.
That's probably reasonable today with a much less constrained transistor budget and advances in fine grained gating of unused subsystems.
> anon (anon.delete@this.anon.com) on November 12, 2014 5:48 am wrote:
>
> > I suppose single wide decoder was tied to the trace cache. If they used a two wide decoder,
> > would they have even needed the trace cache? The first Pentium 4 was very close to 2/3 the
> > IPC of the Pentium III of the same era on SPECint2000. And actually the decoder was not
> > widened to 4 until much later, although it was made more capable in other ways.
> >
> > But this just raises the question, why did they do the trace cache at all? I think a single-wide x86
> > decoder on a high(er) latency path must have looked mighty attractive to take such a big risk.
>
> Fast wide x86 decoding is not cheap.
Well the assertion I replied to was that a two-wide decoder would not have been very costly.
>
> Intel's previous processors, P-Pro/P-II/P-III, had a moderately wide and somewhat fragile
> decoding logic: 3 decoders, of which only two could only decode simple instructions.
Actually I think 3 could decode simple. Only 1 could decode complex, and microcode went to another path.
But IPC of Pentium4 was 2/3 IPC of PentiumIII, so a 2-wide decoder should have been sufficient.
> And a lot of critical loops would become decode limited if not properly scheduled.
>
> Netburst's trace cache was meant as a way to bypass these issues.
> In theory, it would provide robust high bandwidth instruction
> fetch without the need for wide generic x86 decoding logic.
> In practice, if didn't work out so well.
>
> Only after Netburst Intel began improving x86 decoding, with the introduction of µOP fusion in the Banias.
I'm not sure what you mean. x86 decoding has been improved in every generation of Intel microarchitectures before P4.
>
> And still, with Sandy Bridge Intel sucessfully ressurected
> the concept µOP caching to bypass the x86 decoders.
Even earlier with loop buffer too.
> This time it's was paired with robust and wide x86 decoders, but it still
> provides gains in power, bandwidth and branch misprediction penalty.
That's probably reasonable today with a much less constrained transistor budget and advances in fine grained gating of unused subsystems.