Article: AMD's Mobile Strategy
By: Exophase (exophase.delete@this.gmail.com), December 20, 2011 10:24 am
Room: Moderated Discussions
Linus Torvalds (torvalds@linux-foundation.org) on 12/16/11 wrote:
---------------------------
>Why do you confuse "cycles" and "instructions". They have
>nothing to do with each other.
When I said "cycles" I really meant "stages in the execution unit" which was a poor way of saying "uops" ;p
>
>Of your list, this is indeed the only one that is "unusual",
>in that store does that special "address" and "data" uop.
>
>That said, the others are not all that out-of-line for a
>RISC setup, and even the "two uops for store" is mitigated
>to some degree that the x86 addressing modes for the address
>op often *are* equivalent to another RISC instruction.
The thing is, we're not talking about "RISC" here, we're talking specifically about ARM. ARM can do all of those things as one instruction, and its loads and stores have very competent addressing. x86 can do an extra immediate add and it has larger immediate, but most loads and stores are either of the variety array[imm] or array[var] which ARM can handle (for that matter, can handle higher powers of two for the latter). I would contend that when loads and stores are concerned post-increment (*ptr++) is much more common than more complex address modes either from huge offsets or reg and imm offset. Or at the very least I use the former all the time in C and ASM code and the latter pretty rarely, with the one exception being struct->array[var].
>
>>At the very least show the courtesy to cite SOMETHING when
>>bringing up figures like your "1.2 to 1.7" one.
>
>Like you've cited stuff?
What arguments have I made that you feel need citation? I just want to make it clear, I never thought you were making things up, just that your comment of 1.2 to 1.7 uop to instruction ratio was ambiguous because you didn't say what processor it was for.
>Anyway, here is the source
>http://tams-www.informatik.uni-hamburg.de/lehre/2001ss/proseminar/mikroprozessoren/papers/pentium-pro-performance.pdf
>
>which is horribly formatted, but shows that 1.2 - 1.7 number
>(average: 1.35). That's the PPro.
Thank you, that's fairly interesting if extremely outdated information. I still say that P6 uops and ARM instructions are a world apart.
>Here is, for comparison,
>a rather interesting POWER comparison with a more modern
>Intel CPU (Woodcrest), which shows very close to 1.0:
>
>http://lca.ece.utexas.edu/pubs/spec09_ciji.pdf
>>
>
>which is also interesting because their pathlengths are
>actually very comparable with POWER. It is possible (in
>fact likely) that at least part of that is simply
>differences in compilers too, of course.
Yes, now this is much more interesting. Although I would not be willing to accept Power and ARM as anything like equivalent either (sorry, I know that's probably getting old ;p)
This is basically how I see the whole situation vs x86 instructions on modern Intel CPUs vs ARM instructions on modern ARM (note that AMD is doing very different things):
1) uop fusion allows for exactly two cases: load + op (either ALU ops or indirect branches) and store + AGU.
a) ARM can perform load + branch in one cycle, but IIRC ARM64 can't. It can't perform load + ALU.
b) Most RISC cores can perform stores as one operation, with varying levels of addressing complexity (MIPS is quite awful, SPARC is a little better, ARM is a lot better). See my earlier comments on this.
2) unfused x86 uops have four strengths that I can think of over ARM/ARM64 ops: 32-bit immediates, 8/16bit operations, branch fusion, and whatever SIMD operations x86 supports that NEON doesn't. ARM operations have way more advantages (which I listed previously). Some code will like the large immediates and/or sub-32bit operations enough to make a difference, although the latter is mitigated in ARM64 with sign extending operations. ARMv6+ also has packed 8/16-bit operations, although these are removed in ARM64.
3) The relevant decode metric for x86 (on Intel processors anyway) is fused uops much more than it is x86 instructions. That's because whenever possible, and probably much more so than not, SB will fetch fused uops from its uop cache. Ignoring that, the decoders are 4:1:1:1 so it's difficult to realize a decode advantage from multi-fused uop instructions even if you're running straight from the decoders. Therefore it's hard to call RMW operations for instance a single-unit decode (but those still have the advantage of being two decode operations vs three on ARM). Even then, the ratio isn't much different from 1.0, although you should probably ignore the effect branch fusion (not uop fusion) has on bringing this number down.
So from these points it doesn't look like the x86 ISA allows that much per fused uop than an ARM instruction doesn't. Load + op is the main one, but we have actual numbers on the benefits for it. We don't have numbers for the other advantages, and we don't have the numbers for any of ARM's advantages.
In an attempt to bring this back to the point of the conversation: in the power domain of < say, 1W per core, the only x86 processor that really fits the bill is Atom. I don't have hard numbers but based on figures companies have given of Cortex-A15 perf/W vs Cortex-A9 on TSMC 40nm vs 28nm (ie, at the very least, it's higher) and that it's showing up in SoCs meant for smartphones, I expect it to fit in < 1W per core too, or at least when not operating at 2+GHz. Atom has 8 fetch bytes/cycle and 2 decoders, while Cortex-A15 has 16 fetch bytes and 3 decoders. In this power envelope it's quite possible that ISA is limiting the frontend throughput. Granted, Atom's two-issue in-order design wouldn't benefit from much more (maybe more fetch), but Bobcat is out-of-order and has 16-byte fetch and still two decoders.
It'll get more interesting later if we do see very 4-way ARM64 cores that are targeting whatever power budget per core that ARM's next generation Atoms are.
---------------------------
>Why do you confuse "cycles" and "instructions". They have
>nothing to do with each other.
When I said "cycles" I really meant "stages in the execution unit" which was a poor way of saying "uops" ;p
>
>Of your list, this is indeed the only one that is "unusual",
>in that store does that special "address" and "data" uop.
>
>That said, the others are not all that out-of-line for a
>RISC setup, and even the "two uops for store" is mitigated
>to some degree that the x86 addressing modes for the address
>op often *are* equivalent to another RISC instruction.
The thing is, we're not talking about "RISC" here, we're talking specifically about ARM. ARM can do all of those things as one instruction, and its loads and stores have very competent addressing. x86 can do an extra immediate add and it has larger immediate, but most loads and stores are either of the variety array[imm] or array[var] which ARM can handle (for that matter, can handle higher powers of two for the latter). I would contend that when loads and stores are concerned post-increment (*ptr++) is much more common than more complex address modes either from huge offsets or reg and imm offset. Or at the very least I use the former all the time in C and ASM code and the latter pretty rarely, with the one exception being struct->array[var].
>
>>At the very least show the courtesy to cite SOMETHING when
>>bringing up figures like your "1.2 to 1.7" one.
>
>Like you've cited stuff?
What arguments have I made that you feel need citation? I just want to make it clear, I never thought you were making things up, just that your comment of 1.2 to 1.7 uop to instruction ratio was ambiguous because you didn't say what processor it was for.
>Anyway, here is the source
>http://tams-www.informatik.uni-hamburg.de/lehre/2001ss/proseminar/mikroprozessoren/papers/pentium-pro-performance.pdf
>
>which is horribly formatted, but shows that 1.2 - 1.7 number
>(average: 1.35). That's the PPro.
Thank you, that's fairly interesting if extremely outdated information. I still say that P6 uops and ARM instructions are a world apart.
>Here is, for comparison,
>a rather interesting POWER comparison with a more modern
>Intel CPU (Woodcrest), which shows very close to 1.0:
>
>http://lca.ece.utexas.edu/pubs/spec09_ciji.pdf
>>
>
>which is also interesting because their pathlengths are
>actually very comparable with POWER. It is possible (in
>fact likely) that at least part of that is simply
>differences in compilers too, of course.
Yes, now this is much more interesting. Although I would not be willing to accept Power and ARM as anything like equivalent either (sorry, I know that's probably getting old ;p)
This is basically how I see the whole situation vs x86 instructions on modern Intel CPUs vs ARM instructions on modern ARM (note that AMD is doing very different things):
1) uop fusion allows for exactly two cases: load + op (either ALU ops or indirect branches) and store + AGU.
a) ARM can perform load + branch in one cycle, but IIRC ARM64 can't. It can't perform load + ALU.
b) Most RISC cores can perform stores as one operation, with varying levels of addressing complexity (MIPS is quite awful, SPARC is a little better, ARM is a lot better). See my earlier comments on this.
2) unfused x86 uops have four strengths that I can think of over ARM/ARM64 ops: 32-bit immediates, 8/16bit operations, branch fusion, and whatever SIMD operations x86 supports that NEON doesn't. ARM operations have way more advantages (which I listed previously). Some code will like the large immediates and/or sub-32bit operations enough to make a difference, although the latter is mitigated in ARM64 with sign extending operations. ARMv6+ also has packed 8/16-bit operations, although these are removed in ARM64.
3) The relevant decode metric for x86 (on Intel processors anyway) is fused uops much more than it is x86 instructions. That's because whenever possible, and probably much more so than not, SB will fetch fused uops from its uop cache. Ignoring that, the decoders are 4:1:1:1 so it's difficult to realize a decode advantage from multi-fused uop instructions even if you're running straight from the decoders. Therefore it's hard to call RMW operations for instance a single-unit decode (but those still have the advantage of being two decode operations vs three on ARM). Even then, the ratio isn't much different from 1.0, although you should probably ignore the effect branch fusion (not uop fusion) has on bringing this number down.
So from these points it doesn't look like the x86 ISA allows that much per fused uop than an ARM instruction doesn't. Load + op is the main one, but we have actual numbers on the benefits for it. We don't have numbers for the other advantages, and we don't have the numbers for any of ARM's advantages.
In an attempt to bring this back to the point of the conversation: in the power domain of < say, 1W per core, the only x86 processor that really fits the bill is Atom. I don't have hard numbers but based on figures companies have given of Cortex-A15 perf/W vs Cortex-A9 on TSMC 40nm vs 28nm (ie, at the very least, it's higher) and that it's showing up in SoCs meant for smartphones, I expect it to fit in < 1W per core too, or at least when not operating at 2+GHz. Atom has 8 fetch bytes/cycle and 2 decoders, while Cortex-A15 has 16 fetch bytes and 3 decoders. In this power envelope it's quite possible that ISA is limiting the frontend throughput. Granted, Atom's two-issue in-order design wouldn't benefit from much more (maybe more fetch), but Bobcat is out-of-order and has 16-byte fetch and still two decoders.
It'll get more interesting later if we do see very 4-way ARM64 cores that are targeting whatever power budget per core that ARM's next generation Atoms are.