Article: AMD's Mobile Strategy
By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), December 21, 2011 5:34 pm
Room: Moderated Discussions
Seni (seniike@hotmail.com) on 12/21/11 wrote:
---------------------------
>Exophase (exophase@gmail.com) on 12/21/11 wrote:
>
>>Generally only imm32 (or imm8) is available. The only x86-64 instructions available
>>with 64-bit displacements are absolute loads or stores to al/ax/eax/rax. Generally
AFAIK the only instruction with a 64-bit immediate that x64 compilers can generate is mov rax,#imm64.
>I'll have to double-check that. If true, it's a big letdown.
No, it's not. Using 64-bit immediates in the instruction stream is insane. It's not that 64-bit immediates are rarely required, the thing is they are just wasteful in every sense, especially if you want to use them for global variable accesses.
>>>The x86 version combines not only the AGU op and Load, but also up to 1 ALU op,
>>>and the loading and adding in of a full-length immediate.
>>
>>I actually think that store immediate is one of the more useful instructions that x86 has over ARM.
>
>Strange. It might be common but I doubt it has much impact, as its performance
>would be barely different from the 2-instruction equivalent.
That is true for all load/store + alu combination ops. It never gives you better performance.
>>>So for example, the x86-64 instruction
>>>ADD RAX, [RBX + RSI + imm64]
>
>>Well yeah, if this x86 instruction existed.
>
>I really should check these things.
>The 32-bit version exists though, and it would have a 4-instruction ARM equivalent.
>If you need a separate MOV RAX, imm64 then the x86 version takes 2 instructions
>do the work of 6, which is still pretty compact.
You're looking at it from the wrong perspective. Yes, x86 can do this in a single instruction:
ADD RAX, [RBX + RSI + imm32]
But what advantage does this have? It's not equivalent to x += a[i][j] in C, so you still need extra instructions to do some useful work. Once you take those into account, you end up needing a similar number of instructions or cycles to do the same task.
>>LDM/STM was the only big multi-op instruction. ARM64 removes it but instead has
>>load/store pair which is a decent compromise for saving instructions for register
>>save/restore. This is also consistent with ARM's last few uarch decisions, where
>>ldm/stm had 2x the peak bandwidth to L1 compared to ldr/str.. they probably want
>>to still provide for this sort of direct utilization.
>
>Ok, I took a look at LDM/STM and LDP/STP.
>They seem more like a vector op than a series of separate memory accesses. You're
>loading or storing a large contiguous block from a single address. So, on second
>thought, I can't really consider it a big multi-op instruction at all, since number of operations going on is one.
Multi-op in the sense that it is a set of loads and stores combined into a single instruction. Just like ldrd is a combination of 2 loads to consecutive addresses. I think it's a shame to have lost push/pop as it was an huge codesize win. You could literally define them as a sequence of double-load/store instructions, decode it early into uops and the rest of the OoO engine would never have to worry about them.
Wilco
---------------------------
>Exophase (exophase@gmail.com) on 12/21/11 wrote:
>
>>Generally only imm32 (or imm8) is available. The only x86-64 instructions available
>>with 64-bit displacements are absolute loads or stores to al/ax/eax/rax. Generally
AFAIK the only instruction with a 64-bit immediate that x64 compilers can generate is mov rax,#imm64.
>I'll have to double-check that. If true, it's a big letdown.
No, it's not. Using 64-bit immediates in the instruction stream is insane. It's not that 64-bit immediates are rarely required, the thing is they are just wasteful in every sense, especially if you want to use them for global variable accesses.
>>>The x86 version combines not only the AGU op and Load, but also up to 1 ALU op,
>>>and the loading and adding in of a full-length immediate.
>>
>>I actually think that store immediate is one of the more useful instructions that x86 has over ARM.
>
>Strange. It might be common but I doubt it has much impact, as its performance
>would be barely different from the 2-instruction equivalent.
That is true for all load/store + alu combination ops. It never gives you better performance.
>>>So for example, the x86-64 instruction
>>>ADD RAX, [RBX + RSI + imm64]
>
>>Well yeah, if this x86 instruction existed.
>
>I really should check these things.
>The 32-bit version exists though, and it would have a 4-instruction ARM equivalent.
>If you need a separate MOV RAX, imm64 then the x86 version takes 2 instructions
>do the work of 6, which is still pretty compact.
You're looking at it from the wrong perspective. Yes, x86 can do this in a single instruction:
ADD RAX, [RBX + RSI + imm32]
But what advantage does this have? It's not equivalent to x += a[i][j] in C, so you still need extra instructions to do some useful work. Once you take those into account, you end up needing a similar number of instructions or cycles to do the same task.
>>LDM/STM was the only big multi-op instruction. ARM64 removes it but instead has
>>load/store pair which is a decent compromise for saving instructions for register
>>save/restore. This is also consistent with ARM's last few uarch decisions, where
>>ldm/stm had 2x the peak bandwidth to L1 compared to ldr/str.. they probably want
>>to still provide for this sort of direct utilization.
>
>Ok, I took a look at LDM/STM and LDP/STP.
>They seem more like a vector op than a series of separate memory accesses. You're
>loading or storing a large contiguous block from a single address. So, on second
>thought, I can't really consider it a big multi-op instruction at all, since number of operations going on is one.
Multi-op in the sense that it is a set of loads and stores combined into a single instruction. Just like ldrd is a combination of 2 loads to consecutive addresses. I think it's a shame to have lost push/pop as it was an huge codesize win. You could literally define them as a sequence of double-load/store instructions, decode it early into uops and the rest of the OoO engine would never have to worry about them.
Wilco