Article: AMD's Mobile Strategy
By: David Kanter (dkanter.delete@this.realworldtech.com), December 20, 2011 7:21 pm
Room: Moderated Discussions
Exophase (exophase@gmail.com) on 12/20/11 wrote:
>This is basically how I see the whole situation vs x86 >instructions on modern Intel
>CPUs vs ARM instructions on modern ARM (note that AMD is >doing very different things):
>
>1) uop fusion allows for exactly two cases: load + op (either ALU ops or indirect branches) and store + AGU.
>a) ARM can perform load + branch in one cycle, but IIRC ARM64 can't. It can't perform load + ALU.
>b) Most RISC cores can perform stores as one operation, with varying levels of
>addressing complexity (MIPS is quite awful, SPARC is a little better, ARM is a lot
>better). See my earlier comments on this.
You missed one other trick that x86 has, which is the stack engine. Based on the UT paper it has a small, but noticeable impact. IIRC, the paper mentioned about 4% on average, but it was higher for the more complex integer benchmarks.
>2) unfused x86 uops have four strengths that I can think of over ARM/ARM64 ops:
>32-bit immediates, 8/16bit operations, branch fusion, and whatever SIMD operations
>x86 supports that NEON doesn't. ARM operations have way more advantages (which I
>listed previously). Some code will like the large immediates and/or sub-32bit operations
>enough to make a difference, although the latter is mitigated in ARM64 with sign
>extending operations. ARMv6+ also has packed 8/16-bit operations, although these are removed in ARM64.
How is LEA handled in most designs? Is it single uop?
>3) The relevant decode metric for x86 (on Intel processors anyway) is fused uops
>much more than it is x86 instructions. That's because whenever possible, and probably
>much more so than not, SB will fetch fused uops from its uop cache. Ignoring that,
>the decoders are 4:1:1:1 so it's difficult to realize a decode advantage from multi-fused
>uop instructions even if you're running straight from the >decoders.
Therefore it's
>hard to call RMW operations for instance a single-unit decode (but those still have
>the advantage of being two decode operations vs three on ARM). Even then, the ratio
>isn't much different from 1.0, although you should probably ignore the effect branch
>fusion (not uop fusion) has on bringing this number down.
>
>So from these points it doesn't look like the x86 ISA allows that much per fused
>uop than an ARM instruction doesn't. Load + op is the main one, but we have actual
>numbers on the benefits for it. We don't have numbers for >the other advantages,
>and we don't have the numbers for any of ARM's advantages.
It's also not clear how many of those advantages will exist for ARMv8/ARM64 (e.g. shifting).
>In an attempt to bring this back to the point of the conversation: in the power
>domain of < say, 1W per core, the only x86 processor that really fits the bill is
>Atom. I don't have hard numbers but based on figures companies have given of Cortex-A15
>perf/W vs Cortex-A9 on TSMC 40nm vs 28nm (ie, at the very least, it's higher) and
>that it's showing up in SoCs meant for smartphones, I expect it to fit in < 1W per
>core too, or at least when not operating at 2+GHz. Atom has 8 fetch bytes/cycle
>and 2 decoders, while Cortex-A15 has 16 fetch bytes and 3 decoders. In this power
>envelope it's quite possible that ISA is limiting the frontend throughput. Granted,
>Atom's two-issue in-order design wouldn't benefit from much more (maybe more fetch),
>but Bobcat is out-of-order and has 16-byte fetch and still >two decoders.
I don't think it's reasonable to compare "Atom cores today" with "A15 cores that don't yet exist".
>It'll get more interesting later if we do see very 4-way >ARM64 cores that are targeting
>whatever power budget per core that ARM's next generation >Atoms are.
Yeah, I think it will also help to see a new uarch from Intel. That may give a better idea of what's feasible, because I think Intel has avoided uarch changes to focus on system level issues.
David
>This is basically how I see the whole situation vs x86 >instructions on modern Intel
>CPUs vs ARM instructions on modern ARM (note that AMD is >doing very different things):
>
>1) uop fusion allows for exactly two cases: load + op (either ALU ops or indirect branches) and store + AGU.
>a) ARM can perform load + branch in one cycle, but IIRC ARM64 can't. It can't perform load + ALU.
>b) Most RISC cores can perform stores as one operation, with varying levels of
>addressing complexity (MIPS is quite awful, SPARC is a little better, ARM is a lot
>better). See my earlier comments on this.
You missed one other trick that x86 has, which is the stack engine. Based on the UT paper it has a small, but noticeable impact. IIRC, the paper mentioned about 4% on average, but it was higher for the more complex integer benchmarks.
>2) unfused x86 uops have four strengths that I can think of over ARM/ARM64 ops:
>32-bit immediates, 8/16bit operations, branch fusion, and whatever SIMD operations
>x86 supports that NEON doesn't. ARM operations have way more advantages (which I
>listed previously). Some code will like the large immediates and/or sub-32bit operations
>enough to make a difference, although the latter is mitigated in ARM64 with sign
>extending operations. ARMv6+ also has packed 8/16-bit operations, although these are removed in ARM64.
How is LEA handled in most designs? Is it single uop?
>3) The relevant decode metric for x86 (on Intel processors anyway) is fused uops
>much more than it is x86 instructions. That's because whenever possible, and probably
>much more so than not, SB will fetch fused uops from its uop cache. Ignoring that,
>the decoders are 4:1:1:1 so it's difficult to realize a decode advantage from multi-fused
>uop instructions even if you're running straight from the >decoders.
Therefore it's
>hard to call RMW operations for instance a single-unit decode (but those still have
>the advantage of being two decode operations vs three on ARM). Even then, the ratio
>isn't much different from 1.0, although you should probably ignore the effect branch
>fusion (not uop fusion) has on bringing this number down.
>
>So from these points it doesn't look like the x86 ISA allows that much per fused
>uop than an ARM instruction doesn't. Load + op is the main one, but we have actual
>numbers on the benefits for it. We don't have numbers for >the other advantages,
>and we don't have the numbers for any of ARM's advantages.
It's also not clear how many of those advantages will exist for ARMv8/ARM64 (e.g. shifting).
>In an attempt to bring this back to the point of the conversation: in the power
>domain of < say, 1W per core, the only x86 processor that really fits the bill is
>Atom. I don't have hard numbers but based on figures companies have given of Cortex-A15
>perf/W vs Cortex-A9 on TSMC 40nm vs 28nm (ie, at the very least, it's higher) and
>that it's showing up in SoCs meant for smartphones, I expect it to fit in < 1W per
>core too, or at least when not operating at 2+GHz. Atom has 8 fetch bytes/cycle
>and 2 decoders, while Cortex-A15 has 16 fetch bytes and 3 decoders. In this power
>envelope it's quite possible that ISA is limiting the frontend throughput. Granted,
>Atom's two-issue in-order design wouldn't benefit from much more (maybe more fetch),
>but Bobcat is out-of-order and has 16-byte fetch and still >two decoders.
I don't think it's reasonable to compare "Atom cores today" with "A15 cores that don't yet exist".
>It'll get more interesting later if we do see very 4-way >ARM64 cores that are targeting
>whatever power budget per core that ARM's next generation >Atoms are.
Yeah, I think it will also help to see a new uarch from Intel. That may give a better idea of what's feasible, because I think Intel has avoided uarch changes to focus on system level issues.
David