Article: AMD's Mobile Strategy
By: Seni (seniike.delete@this.hotmail.com), December 21, 2011 5:53 pm
Room: Moderated Discussions
Wilco (Wilco.Dijkstra@ntlworld.com) on 12/21/11 wrote:
---------------------------
>>>>The x86 version combines not only the AGU op and Load, but also up to 1 ALU op,
>>>>and the loading and adding in of a full-length immediate.
>>>
>>>I actually think that store immediate is one of the more useful instructions that x86 has over ARM.
>>
>>Strange. It might be common but I doubt it has much impact, as its performance
>>would be barely different from the 2-instruction equivalent.
>
>That is true for all load/store + alu combination ops. It never gives you better performance.
It saves a little bit of fetch bandwidth. It was a pretty big deal on a 286. It's pretty much a wash if you have caches though, but there are still some minor effects having to do with decode and with register file ports and so on.
>You're looking at it from the wrong perspective. Yes, x86 can do this in a single instruction:
>
>ADD RAX, [RBX + RSI + imm32]
>
>But what advantage does this have? It's not equivalent to x += a[i][j] in C, so
>you still need extra instructions to do some useful work. Once you take those into
>account, you end up needing a similar number of instructions or cycles to do the same task.
>
x+= ptr[ix].member; //array of structures
or
x+= *ptr.member[ix]; //structure of arrays
It's not a perfect fit for 2D arrays, but it is a good fit for accessing an element of a structure in an array that is in the stack or on the heap, or was passed into a function by address.
You have a pointer to the base of the array in RBX which is not known until runtime. And you have an index that changes each iteration in RSI. And you have an offset into the structure (the imm). A structure's member offset probably doesn't need to be a 32-bit imm, but that's one of the options x86 gives us, and if you did use structure of arrays, the offsets could get pretty big.
But it's not just a matter of one specific use case. Any memory address is gonna be constructed as a sum of zero or more registers + zero or more immediates, and if there's more than one imm, they can be combined at compile-time. So the eight simplest possible memory addressing modes, in order of increasing complexity are:
[0] //this seems kinda silly, it's only here for completeness
[imm] //this seems awkward to me. It's very inflexible.
[register]
[register+imm] //this is very heavily used and is the standard on most RISCs.
[register+register]
[register+register+imm]
[register+register+register] //this requires a lot of register file ports.
[register+register+register+imm]
There are diminishing returns. Most RISCs draw the line after [reg+imm].
x86 draws it at right after [reg+reg+imm], so x86 is catching a little more of the tail of the distribution.
>Multi-op in the sense that it is a set of loads and stores combined into a single
>instruction. Just like ldrd is a combination of 2 loads to consecutive addresses.
An SSE MOV loads from four consecutive addresses of floats at a time but it's still one load.
From the cache's point of view, the TLB's point of view, etc. consecutive addresses = one op.
---------------------------
>>>>The x86 version combines not only the AGU op and Load, but also up to 1 ALU op,
>>>>and the loading and adding in of a full-length immediate.
>>>
>>>I actually think that store immediate is one of the more useful instructions that x86 has over ARM.
>>
>>Strange. It might be common but I doubt it has much impact, as its performance
>>would be barely different from the 2-instruction equivalent.
>
>That is true for all load/store + alu combination ops. It never gives you better performance.
It saves a little bit of fetch bandwidth. It was a pretty big deal on a 286. It's pretty much a wash if you have caches though, but there are still some minor effects having to do with decode and with register file ports and so on.
>You're looking at it from the wrong perspective. Yes, x86 can do this in a single instruction:
>
>ADD RAX, [RBX + RSI + imm32]
>
>But what advantage does this have? It's not equivalent to x += a[i][j] in C, so
>you still need extra instructions to do some useful work. Once you take those into
>account, you end up needing a similar number of instructions or cycles to do the same task.
>
x+= ptr[ix].member; //array of structures
or
x+= *ptr.member[ix]; //structure of arrays
It's not a perfect fit for 2D arrays, but it is a good fit for accessing an element of a structure in an array that is in the stack or on the heap, or was passed into a function by address.
You have a pointer to the base of the array in RBX which is not known until runtime. And you have an index that changes each iteration in RSI. And you have an offset into the structure (the imm). A structure's member offset probably doesn't need to be a 32-bit imm, but that's one of the options x86 gives us, and if you did use structure of arrays, the offsets could get pretty big.
But it's not just a matter of one specific use case. Any memory address is gonna be constructed as a sum of zero or more registers + zero or more immediates, and if there's more than one imm, they can be combined at compile-time. So the eight simplest possible memory addressing modes, in order of increasing complexity are:
[0] //this seems kinda silly, it's only here for completeness
[imm] //this seems awkward to me. It's very inflexible.
[register]
[register+imm] //this is very heavily used and is the standard on most RISCs.
[register+register]
[register+register+imm]
[register+register+register] //this requires a lot of register file ports.
[register+register+register+imm]
There are diminishing returns. Most RISCs draw the line after [reg+imm].
x86 draws it at right after [reg+reg+imm], so x86 is catching a little more of the tail of the distribution.
>Multi-op in the sense that it is a set of loads and stores combined into a single
>instruction. Just like ldrd is a combination of 2 loads to consecutive addresses.
An SSE MOV loads from four consecutive addresses of floats at a time but it's still one load.
From the cache's point of view, the TLB's point of view, etc. consecutive addresses = one op.