By: anon (anon.delete@this.anon.com), February 2, 2013 5:49 pm
Room: Moderated Discussions
Patrick Chase (patrickjchase.delete@this.gmail.com) on February 2, 2013 9:42 am wrote:
> anon (anon.delete@this.anon.com) on February 2, 2013 5:04 am wrote:
> > Patrick Chase (patrickjchase.delete@this.gmail.com) on February 1, 2013 10:11 pm wrote:
> > > David suggested posting this to the forum. I think he has a few remarks of his own to add on this topic...
> > >
> > > I think that the statement that x86 takes 5-15% more area than RISC is a bit simplistic,
> > > because the penalty is highly variable depending on what performance level you're
> > > targeting and what sort of microarchitecture you have to use to get there.
> > >
> > > As a simple example, x86 is utterly noncompetitive at the area/performance/power levels of, say, a Cortex
> > > M1/M3/M4 or even an R4. We saw the same thing in the mid 80s (compare MIPS R3K to 80386 at similar area,
> > > or R3K to 80486 at similar performance) and we see it now in the low end of the embedded market. x86 has
> > > traditionally been forced to rely on microcode and to forego micropipelining (as opposed to functional unit
> > > level pipelining a la 80386) to hit such low area targets, and that kills performance. RISCs win the area
> > > comparison by significant integer factors in that regime - It took Intel's million-transitor 80486 to catch
> > > up to an R3000 system (including caches and FPU) that totalled a few hundred thousand transistors.
> > >
> > > x86 starts to be marginally competitive once you get to dual-issue
> > > in-order superscalars (P5 vs. contemporaries
> > > in the late 80s; Atom vs. A8/A9 [*] today etc). The "x86 penalty" becomes fairly trivial once you get
> > > up to full-blown out-of-order Tomasulo machines and the like. We saw that with P6 vs. contemporaries,
> >
> > R10000 was introduced within a couple of months of PentiumPro.
> >
> > R10K was 4 way superscalar, 64-bit, OOOE, and excluding the larger L1
> > caches in the MIPS, the core was fewer transistors by the looks.
> >
> > On the sameish process (0.35) and date, 195MHz R10K was ~10% faster in
> > specint95 than the 200MHz PentiumPro, and ~50% faster in specfp95.
>
> Cache size has a huge impact on performance for many workloads. Architects make tradeoffs
> between core complexity and cache size all the time in order to optimize overall performance,
> and you therefore can't ignore caches when makiing comparisons. Area is area.
The core minus caches were fewer transistors for the R10K though. And it was 64 bit, 4 issue.
>
> The R10K die was 298 mm^2, P6 was 196 mm^2. R10K is 50% bigger, 10% faster for integer, and 50% faster for FP.
> If I take your claim of process equivalence at face value then that indicates that Intel had the performance-per-unit-area
> edge at that point
> (though I think you're wrong - Intel's design and process were better. If Intel had designed
> and fabbed the R10K it would have been significantly better than what MIPS came up with).
>
> Want to try again?
I'd rather that you provide evidence for your claim that x86 penalty or PPro vs contemporaries is trivial.
>
> Keep in mind also that most R10K installations used much larger external L2s than P6 (256K for initial
> P6 vs 256K-16MB for R10K) and that also counts against *total* area and *total* performance. Those
> big L2s helped a lot with specfp95, as that suite had notoriously small working sets...
Comparison used had 1MB of L2 for R10K. I believe the P5 L2 cache is faster and the memory latency was lower. Although you are quite right that it's not an apples to apples comparison, which seems difficult or impossible.
> anon (anon.delete@this.anon.com) on February 2, 2013 5:04 am wrote:
> > Patrick Chase (patrickjchase.delete@this.gmail.com) on February 1, 2013 10:11 pm wrote:
> > > David suggested posting this to the forum. I think he has a few remarks of his own to add on this topic...
> > >
> > > I think that the statement that x86 takes 5-15% more area than RISC is a bit simplistic,
> > > because the penalty is highly variable depending on what performance level you're
> > > targeting and what sort of microarchitecture you have to use to get there.
> > >
> > > As a simple example, x86 is utterly noncompetitive at the area/performance/power levels of, say, a Cortex
> > > M1/M3/M4 or even an R4. We saw the same thing in the mid 80s (compare MIPS R3K to 80386 at similar area,
> > > or R3K to 80486 at similar performance) and we see it now in the low end of the embedded market. x86 has
> > > traditionally been forced to rely on microcode and to forego micropipelining (as opposed to functional unit
> > > level pipelining a la 80386) to hit such low area targets, and that kills performance. RISCs win the area
> > > comparison by significant integer factors in that regime - It took Intel's million-transitor 80486 to catch
> > > up to an R3000 system (including caches and FPU) that totalled a few hundred thousand transistors.
> > >
> > > x86 starts to be marginally competitive once you get to dual-issue
> > > in-order superscalars (P5 vs. contemporaries
> > > in the late 80s; Atom vs. A8/A9 [*] today etc). The "x86 penalty" becomes fairly trivial once you get
> > > up to full-blown out-of-order Tomasulo machines and the like. We saw that with P6 vs. contemporaries,
> >
> > R10000 was introduced within a couple of months of PentiumPro.
> >
> > R10K was 4 way superscalar, 64-bit, OOOE, and excluding the larger L1
> > caches in the MIPS, the core was fewer transistors by the looks.
> >
> > On the sameish process (0.35) and date, 195MHz R10K was ~10% faster in
> > specint95 than the 200MHz PentiumPro, and ~50% faster in specfp95.
>
> Cache size has a huge impact on performance for many workloads. Architects make tradeoffs
> between core complexity and cache size all the time in order to optimize overall performance,
> and you therefore can't ignore caches when makiing comparisons. Area is area.
The core minus caches were fewer transistors for the R10K though. And it was 64 bit, 4 issue.
>
> The R10K die was 298 mm^2, P6 was 196 mm^2. R10K is 50% bigger, 10% faster for integer, and 50% faster for FP.
> If I take your claim of process equivalence at face value then that indicates that Intel had the performance-per-unit-area
> edge at that point
> (though I think you're wrong - Intel's design and process were better. If Intel had designed
> and fabbed the R10K it would have been significantly better than what MIPS came up with).
>
> Want to try again?
I'd rather that you provide evidence for your claim that x86 penalty or PPro vs contemporaries is trivial.
>
> Keep in mind also that most R10K installations used much larger external L2s than P6 (256K for initial
> P6 vs 256K-16MB for R10K) and that also counts against *total* area and *total* performance. Those
> big L2s helped a lot with specfp95, as that suite had notoriously small working sets...
Comparison used had 1MB of L2 for R10K. I believe the P5 L2 cache is faster and the memory latency was lower. Although you are quite right that it's not an apples to apples comparison, which seems difficult or impossible.