By: anon (anon.delete@this.anon.com), November 12, 2014 5:48 am
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on November 12, 2014 5:08 am wrote:
> Charles (spam.me.delete@this.not.com.au) on November 12, 2014 2:29 am wrote:
> [snip]
> > Hard to remember the details; some key points were the rather long pipeline (double P6), the rather
> > small L1D cache (half P6), and the single instruction decoder potentially being a bottleneck...
>
> The small Dcache was not inappropriate for a speed demon design and "merely" requires a fast L2.
Even when a miss requires a scheduler replay?
I suppose a larger cache would have fewer misses, but a miss may take longer to be squashed. If misses are ~inversely proportional to sqrt size, and latency is ~proportional to sqrt size, then you may be right about that. I'm not too familiar with how the replay system works.
> The single instruction decoder was (in my opinion) not especially tied to the speed demon nature
> (two-wide decode would not have been especially slow and that path applies to cache misses).
I suppose single wide decoder was tied to the trace cache. If they used a two wide decoder, would they have even needed the trace cache? The first Pentium 4 was very close to 2/3 the IPC of the Pentium III of the same era on SPECint2000. And actually the decoder was not widened to 4 until much later, although it was made more capable in other ways.
But this just raises the question, why did they do the trace cache at all? I think a single-wide x86 decoder on a high(er) latency path must have looked mighty attractive to take such a big risk.
> Charles (spam.me.delete@this.not.com.au) on November 12, 2014 2:29 am wrote:
> [snip]
> > Hard to remember the details; some key points were the rather long pipeline (double P6), the rather
> > small L1D cache (half P6), and the single instruction decoder potentially being a bottleneck...
>
> The small Dcache was not inappropriate for a speed demon design and "merely" requires a fast L2.
Even when a miss requires a scheduler replay?
I suppose a larger cache would have fewer misses, but a miss may take longer to be squashed. If misses are ~inversely proportional to sqrt size, and latency is ~proportional to sqrt size, then you may be right about that. I'm not too familiar with how the replay system works.
> The single instruction decoder was (in my opinion) not especially tied to the speed demon nature
> (two-wide decode would not have been especially slow and that path applies to cache misses).
I suppose single wide decoder was tied to the trace cache. If they used a two wide decoder, would they have even needed the trace cache? The first Pentium 4 was very close to 2/3 the IPC of the Pentium III of the same era on SPECint2000. And actually the decoder was not widened to 4 until much later, although it was made more capable in other ways.
But this just raises the question, why did they do the trace cache at all? I think a single-wide x86 decoder on a high(er) latency path must have looked mighty attractive to take such a big risk.