Article: AMD's Mobile Strategy
By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), December 22, 2011 6:42 am
Room: Moderated Discussions
Seni (seniike@hotmail.com) on 12/21/11 wrote:
---------------------------
>Wilco (Wilco.Dijkstra@ntlworld.com) on 12/21/11 wrote:
>---------------------------
>>>>>The x86 version combines not only the AGU op and Load, but also up to 1 ALU op,
>>>>>and the loading and adding in of a full-length immediate.
>>>>
>>>>I actually think that store immediate is one of the more useful instructions that x86 has over ARM.
>>>
>>>Strange. It might be common but I doubt it has much impact, as its performance
>>>would be barely different from the 2-instruction equivalent.
>>
>>That is true for all load/store + alu combination ops. It never gives you better performance.
>
>It saves a little bit of fetch bandwidth. It was a pretty big deal on a 286.
>It's pretty much a wash if you have caches though, but there are still some minor
>effects having to do with decode and with register file ports and so on.
Indeed. But if you overuse load+op rather than caching in a register then you end up bottlenecking the memory unit. For load+op+store it appears to be useful if you only need to do one operation, and even then you have to be very careful about the decode rules.
>>You're looking at it from the wrong perspective. Yes, x86 can do this in a single instruction:
>>
>>ADD RAX, [RBX + RSI + imm32]
>>
>>But what advantage does this have? It's not equivalent to x += a[i][j] in C, so
>>you still need extra instructions to do some useful work. Once you take those into
>>account, you end up needing a similar number of instructions or cycles to do the same task.
>>
>
>x+= ptr[ix].member; //array of structures
>or
>x+= *ptr.member[ix]; //structure of arrays
>
>It's not a perfect fit for 2D arrays, but it is a good fit for accessing an element
>of a structure in an array that is in the stack or on the heap, or was passed into a function by address.
>You have a pointer to the base of the array in RBX which is not known until runtime.
>And you have an index that changes each iteration in RSI. And you have an offset
>into the structure (the imm). A structure's member offset probably doesn't need
>to be a 32-bit imm, but that's one of the options x86 gives us, and if you did use
>structure of arrays, the offsets could get pretty big.
So the ARM equivalent of these would be something like this for AoS:
add r1, ptr, ix
ldr r0, [r1, #small_imm]
add x, x, r0
or for SoA:
add r1, ptr, #big_imm (could be 2 instructions if not a shifted immediate)
ldr r0, [r1, ix]
add x, x, r0
In both cases the first instruction would be loop invariant and lifted. So in terms of executed instructions the more complex addressing mode doesn't help much. Load+op does help in this case as the loaded value is never reused.
>But it's not just a matter of one specific use case. Any memory address is gonna
>be constructed as a sum of zero or more registers + zero or more immediates, and
>if there's more than one imm, they can be combined at compile-time. So the eight
>simplest possible memory addressing modes, in order of increasing complexity are:
>
>[0] //this seems kinda silly, it's only here for completeness
>[imm] //this seems awkward to me. It's very inflexible.
>[register]
That's Itanium - a bad idea even with post update.
>[register+imm] //this is very heavily used and is the standard on most RISCs.
>[register+register]
Also commonly supported on RISCs.
>[register+register+imm]
>[register+register+register] //this requires a lot of register file ports.
>[register+register+register+imm]
>
>There are diminishing returns. Most RISCs draw the line after [reg+imm].
or [reg+reg].
>x86 draws it at right after [reg+reg+imm], so x86 is catching a little more of the tail of the distribution.
>
>>Multi-op in the sense that it is a set of loads and stores combined into a single
>>instruction. Just like ldrd is a combination of 2 loads to consecutive addresses.
>
>An SSE MOV loads from four consecutive addresses of floats at a time but it's still one load.
>From the cache's point of view, the TLB's point of view, etc. consecutive addresses = one op.
That's true, but don't those SSE loads read 1 register and execute in a single cycle? I don't think you can split them into 4 individual loads, can you? An ldm loads multiple registers over several cycles (typically 2 registers per cycle since ARM11) so it is not exactly like an SSE load.
Wilco
---------------------------
>Wilco (Wilco.Dijkstra@ntlworld.com) on 12/21/11 wrote:
>---------------------------
>>>>>The x86 version combines not only the AGU op and Load, but also up to 1 ALU op,
>>>>>and the loading and adding in of a full-length immediate.
>>>>
>>>>I actually think that store immediate is one of the more useful instructions that x86 has over ARM.
>>>
>>>Strange. It might be common but I doubt it has much impact, as its performance
>>>would be barely different from the 2-instruction equivalent.
>>
>>That is true for all load/store + alu combination ops. It never gives you better performance.
>
>It saves a little bit of fetch bandwidth. It was a pretty big deal on a 286.
>It's pretty much a wash if you have caches though, but there are still some minor
>effects having to do with decode and with register file ports and so on.
Indeed. But if you overuse load+op rather than caching in a register then you end up bottlenecking the memory unit. For load+op+store it appears to be useful if you only need to do one operation, and even then you have to be very careful about the decode rules.
>>You're looking at it from the wrong perspective. Yes, x86 can do this in a single instruction:
>>
>>ADD RAX, [RBX + RSI + imm32]
>>
>>But what advantage does this have? It's not equivalent to x += a[i][j] in C, so
>>you still need extra instructions to do some useful work. Once you take those into
>>account, you end up needing a similar number of instructions or cycles to do the same task.
>>
>
>x+= ptr[ix].member; //array of structures
>or
>x+= *ptr.member[ix]; //structure of arrays
>
>It's not a perfect fit for 2D arrays, but it is a good fit for accessing an element
>of a structure in an array that is in the stack or on the heap, or was passed into a function by address.
>You have a pointer to the base of the array in RBX which is not known until runtime.
>And you have an index that changes each iteration in RSI. And you have an offset
>into the structure (the imm). A structure's member offset probably doesn't need
>to be a 32-bit imm, but that's one of the options x86 gives us, and if you did use
>structure of arrays, the offsets could get pretty big.
So the ARM equivalent of these would be something like this for AoS:
add r1, ptr, ix
ldr r0, [r1, #small_imm]
add x, x, r0
or for SoA:
add r1, ptr, #big_imm (could be 2 instructions if not a shifted immediate)
ldr r0, [r1, ix]
add x, x, r0
In both cases the first instruction would be loop invariant and lifted. So in terms of executed instructions the more complex addressing mode doesn't help much. Load+op does help in this case as the loaded value is never reused.
>But it's not just a matter of one specific use case. Any memory address is gonna
>be constructed as a sum of zero or more registers + zero or more immediates, and
>if there's more than one imm, they can be combined at compile-time. So the eight
>simplest possible memory addressing modes, in order of increasing complexity are:
>
>[0] //this seems kinda silly, it's only here for completeness
>[imm] //this seems awkward to me. It's very inflexible.
>[register]
That's Itanium - a bad idea even with post update.
>[register+imm] //this is very heavily used and is the standard on most RISCs.
>[register+register]
Also commonly supported on RISCs.
>[register+register+imm]
>[register+register+register] //this requires a lot of register file ports.
>[register+register+register+imm]
>
>There are diminishing returns. Most RISCs draw the line after [reg+imm].
or [reg+reg].
>x86 draws it at right after [reg+reg+imm], so x86 is catching a little more of the tail of the distribution.
>
>>Multi-op in the sense that it is a set of loads and stores combined into a single
>>instruction. Just like ldrd is a combination of 2 loads to consecutive addresses.
>
>An SSE MOV loads from four consecutive addresses of floats at a time but it's still one load.
>From the cache's point of view, the TLB's point of view, etc. consecutive addresses = one op.
That's true, but don't those SSE loads read 1 register and execute in a single cycle? I don't think you can split them into 4 individual loads, can you? An ldm loads multiple registers over several cycles (typically 2 registers per cycle since ARM11) so it is not exactly like an SSE load.
Wilco