Article: AMD's Mobile Strategy
By: Exophase (exophase.delete@this.gmail.com), December 21, 2011 8:02 am
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 12/21/11 wrote:
---------------------------
>I'd guess that at least half of those are ADDs with left shift by [1..3]. I.e, no better that x86 LEA.
I'm going to do some more involved profiling tonight. It might be better to profile something newer (these games were all old), because GCC was a lot worse with ARM back then.
>People use pre and post-adjust when available. When unavailable, people find other
>ways to achieve the same objectives, typically with very small overhead. That's
>where x86 3-component addressing shines.
Post-adjust is usually used when iterating through arrays. You can use reg + (loop_counter * element_size) to avoid incrementing a pointer but this often means the loop counter has to increment which adds a compare later. reg + (reg * scale) + imm isn't a huge advantage here when that immediate offset can be hoisted out of the loop.
>
>Also take into account that higher-end OoO ARM cores would have to either crack
>GPR loads with update option (as high-end Power/PPC) or issue them simultaneously
>through a couple of execution ports. So, you wouldn't see much of energy saving.
Sure, but we're talking about decode bandwidth and this is pretty analogous to what you get out of fused uops on current desktop Intel.
>On the other hand, when memcpy is long then x86 rem moves semantics provide more
>opportunities for hw accelleration. The evidence is the fantastic speed (16B/clock,
>equal to peak capabilities of D$) this instructions demonstrate on Nehalem/SandyBridge.
>IIRC, Cortex-A9 achieves 2 or 4B/clock despite its L1D hardware being capable of 8B/clock.
Are you talking about bandwidth to L1, L2, main memory, or what? Cortex-A9 definitely isn't limited to 2 or 4 bytes per cycle to L1, you don't have a problem getting full bandwidth with ldm/stm. When not going to cache these instructions generates appropriately lengthed bursts on the bus, there's no reason why they they don't provide for similar acceleration. I don't see how having burst lengths in the dozens of bytes instead of hundreds or thousands is going to make a big difference.
Of course on ARM64 you just get load/store pairs, but you also still have instructions that can load/store up to 4 16-byte NEON vectors.
---------------------------
>I'd guess that at least half of those are ADDs with left shift by [1..3]. I.e, no better that x86 LEA.
I'm going to do some more involved profiling tonight. It might be better to profile something newer (these games were all old), because GCC was a lot worse with ARM back then.
>People use pre and post-adjust when available. When unavailable, people find other
>ways to achieve the same objectives, typically with very small overhead. That's
>where x86 3-component addressing shines.
Post-adjust is usually used when iterating through arrays. You can use reg + (loop_counter * element_size) to avoid incrementing a pointer but this often means the loop counter has to increment which adds a compare later. reg + (reg * scale) + imm isn't a huge advantage here when that immediate offset can be hoisted out of the loop.
>
>Also take into account that higher-end OoO ARM cores would have to either crack
>GPR loads with update option (as high-end Power/PPC) or issue them simultaneously
>through a couple of execution ports. So, you wouldn't see much of energy saving.
Sure, but we're talking about decode bandwidth and this is pretty analogous to what you get out of fused uops on current desktop Intel.
>On the other hand, when memcpy is long then x86 rem moves semantics provide more
>opportunities for hw accelleration. The evidence is the fantastic speed (16B/clock,
>equal to peak capabilities of D$) this instructions demonstrate on Nehalem/SandyBridge.
>IIRC, Cortex-A9 achieves 2 or 4B/clock despite its L1D hardware being capable of 8B/clock.
Are you talking about bandwidth to L1, L2, main memory, or what? Cortex-A9 definitely isn't limited to 2 or 4 bytes per cycle to L1, you don't have a problem getting full bandwidth with ldm/stm. When not going to cache these instructions generates appropriately lengthed bursts on the bus, there's no reason why they they don't provide for similar acceleration. I don't see how having burst lengths in the dozens of bytes instead of hundreds or thousands is going to make a big difference.
Of course on ARM64 you just get load/store pairs, but you also still have instructions that can load/store up to 4 16-byte NEON vectors.