Article: AMD's Mobile Strategy
By: Brett (ggtgp.delete@this.yahoo.com), December 22, 2011 11:59 pm
Room: Moderated Discussions
Exophase (exophase@gmail.com) on 12/22/11 wrote:
---------------------------
>Wilco (Wilco.Dijkstra@ntlworld.com) on 12/21/11 wrote:
>---------------------------
>>In terms of padding overhead there is no difference between largest or smallest
>>first. In any case for structures this won't matter on ARM64 as the smallest offset
>>is a whopping 4KB! On Thumb-1 however it matters as the maximum offset for 8/16/32-bit
>>loads is just 31/62/124. So smallest first has been best for a while.
>>
>>Wilco
>>
>
>Correct, the overhead is the same. But by having all the padding at the end instead
>of distributed within the structure it's easier to track the offsets manually (if
>you're making defines for ASM or something) and the spatial locality is a little better for cache usage.
>
>I've never programmed for Thumb-1 and god willing never will. The closest I hope
>to come is converting it to ARM code that sucks a lot less. On the other hand, ldrd/strd
>only allow +/- 8bit immediate offsets in both ARM and Thumb-2 encodings so it'd make more sense to put them first.
If you have separate update/render/other loops you can get a good performance boost by organizing by code use split on cache lines.
So the structure becomes: update vars, shared+pad, other vars, shared+pad, render vars.
Be aware of class types that add a hidden class type to the start of your structures, check your alignment in the debugger.
Malloc gives you 16 byte alignment, cache lines are often 64 byte alignment. With large (256+ byte) structures you can play the odds, 16 byte alignment between usage types is fine. You will have some shared variables anyway that can fill the gaps between usage types.
Be warned that this will make the structure look ugly/funny to you at first, but the ~20% performance boost will make you smile.
"Almost all programming can be viewed as an exercise in caching"
---------------------------
>Wilco (Wilco.Dijkstra@ntlworld.com) on 12/21/11 wrote:
>---------------------------
>>In terms of padding overhead there is no difference between largest or smallest
>>first. In any case for structures this won't matter on ARM64 as the smallest offset
>>is a whopping 4KB! On Thumb-1 however it matters as the maximum offset for 8/16/32-bit
>>loads is just 31/62/124. So smallest first has been best for a while.
>>
>>Wilco
>>
>
>Correct, the overhead is the same. But by having all the padding at the end instead
>of distributed within the structure it's easier to track the offsets manually (if
>you're making defines for ASM or something) and the spatial locality is a little better for cache usage.
>
>I've never programmed for Thumb-1 and god willing never will. The closest I hope
>to come is converting it to ARM code that sucks a lot less. On the other hand, ldrd/strd
>only allow +/- 8bit immediate offsets in both ARM and Thumb-2 encodings so it'd make more sense to put them first.
If you have separate update/render/other loops you can get a good performance boost by organizing by code use split on cache lines.
So the structure becomes: update vars, shared+pad, other vars, shared+pad, render vars.
Be aware of class types that add a hidden class type to the start of your structures, check your alignment in the debugger.
Malloc gives you 16 byte alignment, cache lines are often 64 byte alignment. With large (256+ byte) structures you can play the odds, 16 byte alignment between usage types is fine. You will have some shared variables anyway that can fill the gaps between usage types.
Be warned that this will make the structure look ugly/funny to you at first, but the ~20% performance boost will make you smile.
"Almost all programming can be viewed as an exercise in caching"