Article: AMD's Mobile Strategy
By: Exophase (exophase.delete@this.gmail.com), December 15, 2011 10:44 am
Room: Moderated Discussions
Linus Torvalds (torvalds@linux-foundation.org) on 12/15/11 wrote:
---------------------------
>That's a total red herring.
>
>x86 instructions do more.
>
>Doing a two-way x86 decode is not rocket science, and has
>been done for a long time. And it's likely not all that
>different from four-way ARM that is not even done yet.
There is no way in hell that two x86 instructions does the work of two ARM64 instructions in any workload approaching normal. Have you read the ARM64 spec?
x86-64 instructions do load + op and load + op + store in one instruction. Their memory addressing has reg + reg * scale + imm (large or small) where ARM64 gives you either a medium imm or a scaled register. It can use full arbitrary 32-bit constants.
But on the flip side, ARM64 operations are three address, can fold shifts and sign extensions, have more conditional ops (moves, selects, increments, negates, compares vs just moves). They also have pre and post increment load/store which is actually pretty commonly used and load/store pairs. And compare/test + branch if non-zero. And a variety of multiply + add/subtract/negate instructions.
Then there's SIMD where people generally agree that NEON (particularly as it exists in ARM64 spec) is more useful than SSE/AVX, although it depends on the task.
And it has 31 vs 15 GPRs (in both cases stack pointer is a named register) so there will be less need to fold loads and perform rmws.
As far as constants are concerned, ARM64 changes the format again, and the jury is still out on how it will compare vs ARM's traditional 8bit + 4bit ROR format. It has a different encoding for add/sub vs logic, and the latter will be able to represent useful bit-patterns over a full 64-bits, in some cases exceeding what x86-64 can (generally no 64-bit immediates).
So yes, x86-64 will often issue instructions that take multiple ARM64 instructions. And ARM64 will often save instructions over x86-64. Without real studies on real code it's pointless to say which has how much advantage but you're isolating x86 completely which is unfair, and making any claims about 2x work-per-instruction is beyond absurd.
>Those x86 addressing modes are powerful and used. And they
>regularly replace two or more ARM instructions. Just
>do the math: ARM code isn't actually all that much denser
>even in Thumb, yet x86 instructions are rather longer on
>average.
In fact, even x86-64 instructions are [i]on average[/i] less than 4 bytes each thanks to all of the 1-2 byte opcodes. But it's not worth saying much between the two without citing concrete comparisons.
>They are very different. An ARM instruction is closer to
>the old-style uops (and by "old-style" I mean the ones that
>Intel used to produce that didn't have read-modify-write
>versions: the uops in Core 2+ are rather closer to the
>real x86 instructions).
>
>The "uops per instructions" on x86 (again, older x86)
>tended to be in the 1.2-1.7 range on spec, according to some
>papers (again, they are now much closer to 1:1, but that's
>because the modern Core2+ uops are actually more powerful
>than ARM instructions are).
>
If you're looking at P6 and Netburst uops they are substantially less powerful than an ARM64 instruction.
Let's look at Netburst's wonderfully powerful uops:
8/16bit loads are 2 uops
cmovs are 3 uops
push and pop are 2 uops
scaled reg lea is 2-3 uops
bswap is 3 uops
prefetch is 4 uops
inc/dec is 2 uops (!!)
multiplies and divides are 4 uops
shifts and rotates by CL is 2 uops
rotates by carry are 4 uops
bit test is 2-3 uops
bit scan is 2 uops
setcc is 3 uops
indirect branch is 3 uops
near call is 3 uops
indirect call is 4 uops
ret is 4 uops
ARM64 can do all of those things in one instruction. So take your 1.2-1.7 number and reduce it to not count all of these penalties. Now include all of the things that allows ARM64 to save instructions over x86-64. Better yet, get back to us when you have real empirical data (vs an ISA that hasn't been used for anything yet, mind you) instead of "everybody knows" rants based on your feelings.
---------------------------
>That's a total red herring.
>
>x86 instructions do more.
>
>Doing a two-way x86 decode is not rocket science, and has
>been done for a long time. And it's likely not all that
>different from four-way ARM that is not even done yet.
There is no way in hell that two x86 instructions does the work of two ARM64 instructions in any workload approaching normal. Have you read the ARM64 spec?
x86-64 instructions do load + op and load + op + store in one instruction. Their memory addressing has reg + reg * scale + imm (large or small) where ARM64 gives you either a medium imm or a scaled register. It can use full arbitrary 32-bit constants.
But on the flip side, ARM64 operations are three address, can fold shifts and sign extensions, have more conditional ops (moves, selects, increments, negates, compares vs just moves). They also have pre and post increment load/store which is actually pretty commonly used and load/store pairs. And compare/test + branch if non-zero. And a variety of multiply + add/subtract/negate instructions.
Then there's SIMD where people generally agree that NEON (particularly as it exists in ARM64 spec) is more useful than SSE/AVX, although it depends on the task.
And it has 31 vs 15 GPRs (in both cases stack pointer is a named register) so there will be less need to fold loads and perform rmws.
As far as constants are concerned, ARM64 changes the format again, and the jury is still out on how it will compare vs ARM's traditional 8bit + 4bit ROR format. It has a different encoding for add/sub vs logic, and the latter will be able to represent useful bit-patterns over a full 64-bits, in some cases exceeding what x86-64 can (generally no 64-bit immediates).
So yes, x86-64 will often issue instructions that take multiple ARM64 instructions. And ARM64 will often save instructions over x86-64. Without real studies on real code it's pointless to say which has how much advantage but you're isolating x86 completely which is unfair, and making any claims about 2x work-per-instruction is beyond absurd.
>Those x86 addressing modes are powerful and used. And they
>regularly replace two or more ARM instructions. Just
>do the math: ARM code isn't actually all that much denser
>even in Thumb, yet x86 instructions are rather longer on
>average.
In fact, even x86-64 instructions are [i]on average[/i] less than 4 bytes each thanks to all of the 1-2 byte opcodes. But it's not worth saying much between the two without citing concrete comparisons.
>They are very different. An ARM instruction is closer to
>the old-style uops (and by "old-style" I mean the ones that
>Intel used to produce that didn't have read-modify-write
>versions: the uops in Core 2+ are rather closer to the
>real x86 instructions).
>
>The "uops per instructions" on x86 (again, older x86)
>tended to be in the 1.2-1.7 range on spec, according to some
>papers (again, they are now much closer to 1:1, but that's
>because the modern Core2+ uops are actually more powerful
>than ARM instructions are).
>
If you're looking at P6 and Netburst uops they are substantially less powerful than an ARM64 instruction.
Let's look at Netburst's wonderfully powerful uops:
8/16bit loads are 2 uops
cmovs are 3 uops
push and pop are 2 uops
scaled reg lea is 2-3 uops
bswap is 3 uops
prefetch is 4 uops
inc/dec is 2 uops (!!)
multiplies and divides are 4 uops
shifts and rotates by CL is 2 uops
rotates by carry are 4 uops
bit test is 2-3 uops
bit scan is 2 uops
setcc is 3 uops
indirect branch is 3 uops
near call is 3 uops
indirect call is 4 uops
ret is 4 uops
ARM64 can do all of those things in one instruction. So take your 1.2-1.7 number and reduce it to not count all of these penalties. Now include all of the things that allows ARM64 to save instructions over x86-64. Better yet, get back to us when you have real empirical data (vs an ISA that hasn't been used for anything yet, mind you) instead of "everybody knows" rants based on your feelings.