By: anon (anon.delete@this.anon.com), February 3, 2013 2:58 pm
Room: Moderated Discussions
Patrick Chase (patrickjchase.delete@this.gmail.com) on February 3, 2013 2:27 pm wrote:
> anon (anon.delete@this.anon.com) on February 2, 2013 5:49 pm wrote:
> > Patrick Chase (patrickjchase.delete@this.gmail.com) on February 2, 2013 9:42 am wrote:
> > > anon (anon.delete@this.anon.com) on February 2, 2013 5:04 am wrote:
> > > > On the sameish process (0.35) and date, 195MHz R10K was ~10% faster in
> > > > specint95 than the 200MHz PentiumPro, and ~50% faster in specfp95.
> > >
> > > Cache size has a huge impact on performance for many workloads. Architects make tradeoffs
> > > between core complexity and cache size all the time in order to optimize overall performance,
> > > and you therefore can't ignore caches when makiing comparisons. Area is area.
> >
> > The core minus caches were fewer transistors for the R10K though.
>
> True, but irrelevant for the reason I gave above. All that matters is *total*
> performance and *total* area. The tradeoffs that each team made to get there
> are immaterial to the topic at hand (the net area penalty for x86).
>
> R10K was 307 mm^2 in for core + L1s, PPro was 198 mm^2 for the same,
> both in 0.35 um. PPro was ~1/3 smaller, period, end of discussion.
I would not say that is the only thing that matters, if you make a statement like "x86 penalty is trivial".
>
> > And it was 64 bit, 4 issue.
>
> The R10K was indeed 64-bit, and that is a benefit that wouldn't have shown up in SPEC, so some credit needs
> to be given there. What do you think the area penalty for 64b was? (hint: Not much in this case, given
> that the R10K die was dominated by caches and many of the datapaths were already 64b for DP FP).
>
> 4-issue is again a detail that is irrelevant to the bottom-line performance/area tradeoff that is the topic
> of this discussion, though I would note that sheer issue/retire bandwidth is far from the entire picture in
> a modern OoO design. The PPro had a bigger OoO window (40 instructions vs 32) and could therefore both expose
> more simultaneous cache misses and tolerate more latency (for example misses to L2) without stalling.
>
> Furthermore, the actual execution backends of both processors were of similar widths. The main difference was
> that the R10K had only 1 combined load/store unit and a second FPU, whereas PPro had independent load and store
> units but only 1 FPU (which was on the same port as one of the integer units). It's rare to have the "right" operation
> mix to fully utilize (or even 80% utilize) the execution resources of an OoO CPU, so I expect that the R10K's
> quad-issue capability was probably only of benefit to select (scientific/workstation?) workloads.
>
> > I'd rather that you provide evidence for your claim that x86 penalty or PPro vs contemporaries is trivial.
>
> Part depends on what you call "trivial". I would say that David's estimate of 5-15% area penalty for equivalent
> performance is trivial for all practical intents, in that purchasing decisions will be determined by other
> factors (ecosystem etc) once you get that close. The P6 was where x86 first got into that range IMO. Plenty
> of contemporary RISCs were faster (R10K, Alpha 21164, PA8000) but those RISCs were also quite a bit bigger,
> and they were typically larger by a greater percentage than they were faster.
>
> With that said, you gave me plenty of evidence yourself. R10K was >50% bigger for a 10% advantage
> in integer and a 50% advantage in FP. Even crediting some of that size to 64 bit-ness, I'd call
> that evidence that the price/performance advantage of RISC was within the trivial range by that
> point. The market agreed - The P6 marked the end of the line for most RISC workstation/server
> players not named IBM (though a few staggered on for a while like zombies).
I don't think my evidence gives strong support for that. Adding a lot of cache was not unusual of low volume server/workstation RISCs, and does not necessarily give a linear performance increase with die size.
>
> ARM did just fine because they were playing down in that gatecount/performance regime where Intel
> can't compete with x86 (though the Xscale was arguably the best ARM uarch of its generation...)
>
> > Comparison used had 1MB of L2 for R10K. I believe the P5 L2 cache is faster
> > and the memory latency was lower. Although you are quite right that it's
> > not an apples to apples comparison, which seems difficult or impossible.
>
> Agreed. The P6 L2 was unusually fast.
> anon (anon.delete@this.anon.com) on February 2, 2013 5:49 pm wrote:
> > Patrick Chase (patrickjchase.delete@this.gmail.com) on February 2, 2013 9:42 am wrote:
> > > anon (anon.delete@this.anon.com) on February 2, 2013 5:04 am wrote:
> > > > On the sameish process (0.35) and date, 195MHz R10K was ~10% faster in
> > > > specint95 than the 200MHz PentiumPro, and ~50% faster in specfp95.
> > >
> > > Cache size has a huge impact on performance for many workloads. Architects make tradeoffs
> > > between core complexity and cache size all the time in order to optimize overall performance,
> > > and you therefore can't ignore caches when makiing comparisons. Area is area.
> >
> > The core minus caches were fewer transistors for the R10K though.
>
> True, but irrelevant for the reason I gave above. All that matters is *total*
> performance and *total* area. The tradeoffs that each team made to get there
> are immaterial to the topic at hand (the net area penalty for x86).
>
> R10K was 307 mm^2 in for core + L1s, PPro was 198 mm^2 for the same,
> both in 0.35 um. PPro was ~1/3 smaller, period, end of discussion.
I would not say that is the only thing that matters, if you make a statement like "x86 penalty is trivial".
>
> > And it was 64 bit, 4 issue.
>
> The R10K was indeed 64-bit, and that is a benefit that wouldn't have shown up in SPEC, so some credit needs
> to be given there. What do you think the area penalty for 64b was? (hint: Not much in this case, given
> that the R10K die was dominated by caches and many of the datapaths were already 64b for DP FP).
>
> 4-issue is again a detail that is irrelevant to the bottom-line performance/area tradeoff that is the topic
> of this discussion, though I would note that sheer issue/retire bandwidth is far from the entire picture in
> a modern OoO design. The PPro had a bigger OoO window (40 instructions vs 32) and could therefore both expose
> more simultaneous cache misses and tolerate more latency (for example misses to L2) without stalling.
>
> Furthermore, the actual execution backends of both processors were of similar widths. The main difference was
> that the R10K had only 1 combined load/store unit and a second FPU, whereas PPro had independent load and store
> units but only 1 FPU (which was on the same port as one of the integer units). It's rare to have the "right" operation
> mix to fully utilize (or even 80% utilize) the execution resources of an OoO CPU, so I expect that the R10K's
> quad-issue capability was probably only of benefit to select (scientific/workstation?) workloads.
>
> > I'd rather that you provide evidence for your claim that x86 penalty or PPro vs contemporaries is trivial.
>
> Part depends on what you call "trivial". I would say that David's estimate of 5-15% area penalty for equivalent
> performance is trivial for all practical intents, in that purchasing decisions will be determined by other
> factors (ecosystem etc) once you get that close. The P6 was where x86 first got into that range IMO. Plenty
> of contemporary RISCs were faster (R10K, Alpha 21164, PA8000) but those RISCs were also quite a bit bigger,
> and they were typically larger by a greater percentage than they were faster.
>
> With that said, you gave me plenty of evidence yourself. R10K was >50% bigger for a 10% advantage
> in integer and a 50% advantage in FP. Even crediting some of that size to 64 bit-ness, I'd call
> that evidence that the price/performance advantage of RISC was within the trivial range by that
> point. The market agreed - The P6 marked the end of the line for most RISC workstation/server
> players not named IBM (though a few staggered on for a while like zombies).
I don't think my evidence gives strong support for that. Adding a lot of cache was not unusual of low volume server/workstation RISCs, and does not necessarily give a linear performance increase with die size.
>
> ARM did just fine because they were playing down in that gatecount/performance regime where Intel
> can't compete with x86 (though the Xscale was arguably the best ARM uarch of its generation...)
>
> > Comparison used had 1MB of L2 for R10K. I believe the P5 L2 cache is faster
> > and the memory latency was lower. Although you are quite right that it's
> > not an apples to apples comparison, which seems difficult or impossible.
>
> Agreed. The P6 L2 was unusually fast.