By: Patrick Chase (patrickjchase.delete@this.gmail.com), February 3, 2013 2:27 pm
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on February 2, 2013 5:49 pm wrote:
> Patrick Chase (patrickjchase.delete@this.gmail.com) on February 2, 2013 9:42 am wrote:
> > anon (anon.delete@this.anon.com) on February 2, 2013 5:04 am wrote:
> > > On the sameish process (0.35) and date, 195MHz R10K was ~10% faster in
> > > specint95 than the 200MHz PentiumPro, and ~50% faster in specfp95.
> >
> > Cache size has a huge impact on performance for many workloads. Architects make tradeoffs
> > between core complexity and cache size all the time in order to optimize overall performance,
> > and you therefore can't ignore caches when makiing comparisons. Area is area.
>
> The core minus caches were fewer transistors for the R10K though.
True, but irrelevant for the reason I gave above. All that matters is *total* performance and *total* area. The tradeoffs that each team made to get there are immaterial to the topic at hand (the net area penalty for x86).
R10K was 307 mm^2 in for core + L1s, PPro was 198 mm^2 for the same, both in 0.35 um. PPro was ~1/3 smaller, period, end of discussion.
> And it was 64 bit, 4 issue.
The R10K was indeed 64-bit, and that is a benefit that wouldn't have shown up in SPEC, so some credit needs to be given there. What do you think the area penalty for 64b was? (hint: Not much in this case, given that the R10K die was dominated by caches and many of the datapaths were already 64b for DP FP).
4-issue is again a detail that is irrelevant to the bottom-line performance/area tradeoff that is the topic of this discussion, though I would note that sheer issue/retire bandwidth is far from the entire picture in a modern OoO design. The PPro had a bigger OoO window (40 instructions vs 32) and could therefore both expose more simultaneous cache misses and tolerate more latency (for example misses to L2) without stalling.
Furthermore, the actual execution backends of both processors were of similar widths. The main difference was that the R10K had only 1 combined load/store unit and a second FPU, whereas PPro had independent load and store units but only 1 FPU (which was on the same port as one of the integer units). It's rare to have the "right" operation mix to fully utilize (or even 80% utilize) the execution resources of an OoO CPU, so I expect that the R10K's quad-issue capability was probably only of benefit to select (scientific/workstation?) workloads.
> I'd rather that you provide evidence for your claim that x86 penalty or PPro vs contemporaries is trivial.
Part depends on what you call "trivial". I would say that David's estimate of 5-15% area penalty for equivalent performance is trivial for all practical intents, in that purchasing decisions will be determined by other factors (ecosystem etc) once you get that close. The P6 was where x86 first got into that range IMO. Plenty of contemporary RISCs were faster (R10K, Alpha 21164, PA8000) but those RISCs were also quite a bit bigger, and they were typically larger by a greater percentage than they were faster.
With that said, you gave me plenty of evidence yourself. R10K was >50% bigger for a 10% advantage in integer and a 50% advantage in FP. Even crediting some of that size to 64 bit-ness, I'd call that evidence that the price/performance advantage of RISC was within the trivial range by that point. The market agreed - The P6 marked the end of the line for most RISC workstation/server players not named IBM (though a few staggered on for a while like zombies).
ARM did just fine because they were playing down in that gatecount/performance regime where Intel can't compete with x86 (though the Xscale was arguably the best ARM uarch of its generation...)
> Comparison used had 1MB of L2 for R10K. I believe the P5 L2 cache is faster
> and the memory latency was lower. Although you are quite right that it's
> not an apples to apples comparison, which seems difficult or impossible.
Agreed. The P6 L2 was unusually fast.
> Patrick Chase (patrickjchase.delete@this.gmail.com) on February 2, 2013 9:42 am wrote:
> > anon (anon.delete@this.anon.com) on February 2, 2013 5:04 am wrote:
> > > On the sameish process (0.35) and date, 195MHz R10K was ~10% faster in
> > > specint95 than the 200MHz PentiumPro, and ~50% faster in specfp95.
> >
> > Cache size has a huge impact on performance for many workloads. Architects make tradeoffs
> > between core complexity and cache size all the time in order to optimize overall performance,
> > and you therefore can't ignore caches when makiing comparisons. Area is area.
>
> The core minus caches were fewer transistors for the R10K though.
True, but irrelevant for the reason I gave above. All that matters is *total* performance and *total* area. The tradeoffs that each team made to get there are immaterial to the topic at hand (the net area penalty for x86).
R10K was 307 mm^2 in for core + L1s, PPro was 198 mm^2 for the same, both in 0.35 um. PPro was ~1/3 smaller, period, end of discussion.
> And it was 64 bit, 4 issue.
The R10K was indeed 64-bit, and that is a benefit that wouldn't have shown up in SPEC, so some credit needs to be given there. What do you think the area penalty for 64b was? (hint: Not much in this case, given that the R10K die was dominated by caches and many of the datapaths were already 64b for DP FP).
4-issue is again a detail that is irrelevant to the bottom-line performance/area tradeoff that is the topic of this discussion, though I would note that sheer issue/retire bandwidth is far from the entire picture in a modern OoO design. The PPro had a bigger OoO window (40 instructions vs 32) and could therefore both expose more simultaneous cache misses and tolerate more latency (for example misses to L2) without stalling.
Furthermore, the actual execution backends of both processors were of similar widths. The main difference was that the R10K had only 1 combined load/store unit and a second FPU, whereas PPro had independent load and store units but only 1 FPU (which was on the same port as one of the integer units). It's rare to have the "right" operation mix to fully utilize (or even 80% utilize) the execution resources of an OoO CPU, so I expect that the R10K's quad-issue capability was probably only of benefit to select (scientific/workstation?) workloads.
> I'd rather that you provide evidence for your claim that x86 penalty or PPro vs contemporaries is trivial.
Part depends on what you call "trivial". I would say that David's estimate of 5-15% area penalty for equivalent performance is trivial for all practical intents, in that purchasing decisions will be determined by other factors (ecosystem etc) once you get that close. The P6 was where x86 first got into that range IMO. Plenty of contemporary RISCs were faster (R10K, Alpha 21164, PA8000) but those RISCs were also quite a bit bigger, and they were typically larger by a greater percentage than they were faster.
With that said, you gave me plenty of evidence yourself. R10K was >50% bigger for a 10% advantage in integer and a 50% advantage in FP. Even crediting some of that size to 64 bit-ness, I'd call that evidence that the price/performance advantage of RISC was within the trivial range by that point. The market agreed - The P6 marked the end of the line for most RISC workstation/server players not named IBM (though a few staggered on for a while like zombies).
ARM did just fine because they were playing down in that gatecount/performance regime where Intel can't compete with x86 (though the Xscale was arguably the best ARM uarch of its generation...)
> Comparison used had 1MB of L2 for R10K. I believe the P5 L2 cache is faster
> and the memory latency was lower. Although you are quite right that it's
> not an apples to apples comparison, which seems difficult or impossible.
Agreed. The P6 L2 was unusually fast.