By: Per Hesselgren (grabb1948.delete@this.passagen.se), February 3, 2013 1:44 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on February 2, 2013 5:49 pm wrote:
> Patrick Chase (patrickjchase.delete@this.gmail.com) on February 2, 2013 9:42 am wrote:
> > anon (anon.delete@this.anon.com) on February 2, 2013 5:04 am wrote:
> > > Patrick Chase (patrickjchase.delete@this.gmail.com) on February 1, 2013 10:11 pm wrote:
> > > > David suggested posting this to the forum. I think he has a few remarks of his own to add on this topic...
> > > >
> > > > I think that the statement that x86 takes 5-15% more area than RISC is a bit simplistic,
> > > > because the penalty is highly variable depending on what performance level you're
> > > > targeting and what sort of microarchitecture you have to use to get there.
> > > >
> > > > As a simple example, x86 is utterly noncompetitive at the area/performance/power levels of, say, a Cortex
> > > > M1/M3/M4 or even an R4. We saw the same thing in the mid 80s (compare MIPS R3K to 80386 at similar area,
> > > > or R3K to 80486 at similar performance) and we see it now in the low end of the embedded market. x86 has
> > > > traditionally been forced to rely on microcode and to forego micropipelining (as opposed to functional unit
> > > > level pipelining a la 80386) to hit such low area targets, and that kills performance. RISCs win the area
> > > > comparison by significant integer factors in that regime - It took Intel's million-transitor 80486 to catch
> > > > up to an R3000 system (including caches and FPU) that totalled a few hundred thousand transistors.
> > > >
> > > > x86 starts to be marginally competitive once you get to dual-issue
> > > > in-order superscalars (P5 vs. contemporaries
> > > > in the late 80s; Atom vs. A8/A9 [*] today etc). The "x86 penalty" becomes fairly trivial once you get
> > > > up to full-blown out-of-order Tomasulo machines and the like. We saw that with P6 vs. contemporaries,
> > >
> > > R10000 was introduced within a couple of months of PentiumPro.
> > >
> > > R10K was 4 way superscalar, 64-bit, OOOE, and excluding the larger L1
> > > caches in the MIPS, the core was fewer transistors by the looks.
> > >
> > > On the sameish process (0.35) and date, 195MHz R10K was ~10% faster in
> > > specint95 than the 200MHz PentiumPro, and ~50% faster in specfp95.
> >
> > Cache size has a huge impact on performance for many workloads. Architects make tradeoffs
> > between core complexity and cache size all the time in order to optimize overall performance,
> > and you therefore can't ignore caches when makiing comparisons. Area is area.
>
> The core minus caches were fewer transistors for the R10K though. And it was 64 bit, 4 issue.
>
> >
> > The R10K die was 298 mm^2, P6 was 196 mm^2. R10K is 50%
> > bigger, 10% faster for integer, and 50% faster for FP.
> > If I take your claim of process equivalence at face value then
> > that indicates that Intel had the performance-per-unit-area
> > edge at that point
> > (though I think you're wrong - Intel's design and process were better. If Intel had designed
> > and fabbed the R10K it would have been significantly better than what MIPS came up with).
> >
> > Want to try again?
>
> I'd rather that you provide evidence for your claim that x86 penalty or PPro vs contemporaries is trivial.
>
> >
> > Keep in mind also that most R10K installations used much larger external L2s than P6 (256K for initial
> > P6 vs 256K-16MB for R10K) and that also counts against *total* area and *total* performance. Those
> > big L2s helped a lot with specfp95, as that suite had notoriously small working sets...
>
> Comparison used had 1MB of L2 for R10K. I believe the P5 L2 cache is faster
> and the memory latency was lower. Although you are quite right that it's
> not an apples to apples comparison, which seems difficult or impossible.
The funny thing with SPEC95 (both int and fp) is that the three versions of Pentium Pro (L2=256,512 or 1024) are very close. This would be very unlikely in later SPEC. Could this be typical for a microserver? A workload like SPEC95 but not like SPEC 2000 or 2006.
> Patrick Chase (patrickjchase.delete@this.gmail.com) on February 2, 2013 9:42 am wrote:
> > anon (anon.delete@this.anon.com) on February 2, 2013 5:04 am wrote:
> > > Patrick Chase (patrickjchase.delete@this.gmail.com) on February 1, 2013 10:11 pm wrote:
> > > > David suggested posting this to the forum. I think he has a few remarks of his own to add on this topic...
> > > >
> > > > I think that the statement that x86 takes 5-15% more area than RISC is a bit simplistic,
> > > > because the penalty is highly variable depending on what performance level you're
> > > > targeting and what sort of microarchitecture you have to use to get there.
> > > >
> > > > As a simple example, x86 is utterly noncompetitive at the area/performance/power levels of, say, a Cortex
> > > > M1/M3/M4 or even an R4. We saw the same thing in the mid 80s (compare MIPS R3K to 80386 at similar area,
> > > > or R3K to 80486 at similar performance) and we see it now in the low end of the embedded market. x86 has
> > > > traditionally been forced to rely on microcode and to forego micropipelining (as opposed to functional unit
> > > > level pipelining a la 80386) to hit such low area targets, and that kills performance. RISCs win the area
> > > > comparison by significant integer factors in that regime - It took Intel's million-transitor 80486 to catch
> > > > up to an R3000 system (including caches and FPU) that totalled a few hundred thousand transistors.
> > > >
> > > > x86 starts to be marginally competitive once you get to dual-issue
> > > > in-order superscalars (P5 vs. contemporaries
> > > > in the late 80s; Atom vs. A8/A9 [*] today etc). The "x86 penalty" becomes fairly trivial once you get
> > > > up to full-blown out-of-order Tomasulo machines and the like. We saw that with P6 vs. contemporaries,
> > >
> > > R10000 was introduced within a couple of months of PentiumPro.
> > >
> > > R10K was 4 way superscalar, 64-bit, OOOE, and excluding the larger L1
> > > caches in the MIPS, the core was fewer transistors by the looks.
> > >
> > > On the sameish process (0.35) and date, 195MHz R10K was ~10% faster in
> > > specint95 than the 200MHz PentiumPro, and ~50% faster in specfp95.
> >
> > Cache size has a huge impact on performance for many workloads. Architects make tradeoffs
> > between core complexity and cache size all the time in order to optimize overall performance,
> > and you therefore can't ignore caches when makiing comparisons. Area is area.
>
> The core minus caches were fewer transistors for the R10K though. And it was 64 bit, 4 issue.
>
> >
> > The R10K die was 298 mm^2, P6 was 196 mm^2. R10K is 50%
> > bigger, 10% faster for integer, and 50% faster for FP.
> > If I take your claim of process equivalence at face value then
> > that indicates that Intel had the performance-per-unit-area
> > edge at that point
> > (though I think you're wrong - Intel's design and process were better. If Intel had designed
> > and fabbed the R10K it would have been significantly better than what MIPS came up with).
> >
> > Want to try again?
>
> I'd rather that you provide evidence for your claim that x86 penalty or PPro vs contemporaries is trivial.
>
> >
> > Keep in mind also that most R10K installations used much larger external L2s than P6 (256K for initial
> > P6 vs 256K-16MB for R10K) and that also counts against *total* area and *total* performance. Those
> > big L2s helped a lot with specfp95, as that suite had notoriously small working sets...
>
> Comparison used had 1MB of L2 for R10K. I believe the P5 L2 cache is faster
> and the memory latency was lower. Although you are quite right that it's
> not an apples to apples comparison, which seems difficult or impossible.
The funny thing with SPEC95 (both int and fp) is that the three versions of Pentium Pro (L2=256,512 or 1024) are very close. This would be very unlikely in later SPEC. Could this be typical for a microserver? A workload like SPEC95 but not like SPEC 2000 or 2006.