By: Patrick Chase (patrickjchase.delete@this.gmail.com), February 1, 2013 10:11 pm
Room: Moderated Discussions
David suggested posting this to the forum. I think he has a few remarks of his own to add on this topic...
I think that the statement that x86 takes 5-15% more area than RISC is a bit simplistic, because the penalty is highly variable depending on what performance level you're targeting and what sort of microarchitecture you have to use to get there.
As a simple example, x86 is utterly noncompetitive at the area/performance/power levels of, say, a Cortex M1/M3/M4 or even an R4. We saw the same thing in the mid 80s (compare MIPS R3K to 80386 at similar area, or R3K to 80486 at similar performance) and we see it now in the low end of the embedded market. x86 has traditionally been forced to rely on microcode and to forego micropipelining (as opposed to functional unit level pipelining a la 80386) to hit such low area targets, and that kills performance. RISCs win the area comparison by significant integer factors in that regime - It took Intel's million-transitor 80486 to catch up to an R3000 system (including caches and FPU) that totalled a few hundred thousand transistors.
x86 starts to be marginally competitive once you get to dual-issue in-order superscalars (P5 vs. contemporaries in the late 80s; Atom vs. A8/A9 [*] today etc). The "x86 penalty" becomes fairly trivial once you get up to full-blown out-of-order Tomasulo machines and the like. We saw that with P6 vs. contemporaries, and I expect that we'll see it again shortly once Intel gets their Bonnell replacement out. That's *very* relevant to the discussion at hand, because the ARM microserver advocates seem to be taking area/performance advantages in the A8/A9 regime, and assuming those will hold when comparing Cortex A-57 to its Intel competitors. Based on history I don't think that's a very smart bet.
My own take is that for ARM-based microservers to survive they need to stay down in the "many weak cores" regime and focus on massively parallel workloads that can tolerate the latency penalty. If they try to move up into higher performance brackets then they'll be playing directly into Intel's hand.
[*] Yes, I do realize that the A9 is OoO. It's capabilities in that regard are so limited that one wonders why they bothered, though. Embedded workloads are typically recompiled and optimized for each product and often use explicit prefetch to "expose" cache misses, and both of those tend to reduce the advantage of OoO.
I think that the statement that x86 takes 5-15% more area than RISC is a bit simplistic, because the penalty is highly variable depending on what performance level you're targeting and what sort of microarchitecture you have to use to get there.
As a simple example, x86 is utterly noncompetitive at the area/performance/power levels of, say, a Cortex M1/M3/M4 or even an R4. We saw the same thing in the mid 80s (compare MIPS R3K to 80386 at similar area, or R3K to 80486 at similar performance) and we see it now in the low end of the embedded market. x86 has traditionally been forced to rely on microcode and to forego micropipelining (as opposed to functional unit level pipelining a la 80386) to hit such low area targets, and that kills performance. RISCs win the area comparison by significant integer factors in that regime - It took Intel's million-transitor 80486 to catch up to an R3000 system (including caches and FPU) that totalled a few hundred thousand transistors.
x86 starts to be marginally competitive once you get to dual-issue in-order superscalars (P5 vs. contemporaries in the late 80s; Atom vs. A8/A9 [*] today etc). The "x86 penalty" becomes fairly trivial once you get up to full-blown out-of-order Tomasulo machines and the like. We saw that with P6 vs. contemporaries, and I expect that we'll see it again shortly once Intel gets their Bonnell replacement out. That's *very* relevant to the discussion at hand, because the ARM microserver advocates seem to be taking area/performance advantages in the A8/A9 regime, and assuming those will hold when comparing Cortex A-57 to its Intel competitors. Based on history I don't think that's a very smart bet.
My own take is that for ARM-based microservers to survive they need to stay down in the "many weak cores" regime and focus on massively parallel workloads that can tolerate the latency penalty. If they try to move up into higher performance brackets then they'll be playing directly into Intel's hand.
[*] Yes, I do realize that the A9 is OoO. It's capabilities in that regard are so limited that one wonders why they bothered, though. Embedded workloads are typically recompiled and optimized for each product and often use explicit prefetch to "expose" cache misses, and both of those tend to reduce the advantage of OoO.