Article: AMD's Mobile Strategy
By: Michael S (already5chosen.delete@this.yahoo.com), December 21, 2011 10:36 am
Room: Moderated Discussions
Exophase (exophase@gmail.com) on 12/21/11 wrote:
---------------------------
>Michael S (already5chosen@yahoo.com) on 12/21/11 wrote:
>---------------------------
>>
>>On the other hand, when memcpy is long then x86 rem moves semantics provide more
>>opportunities for hw accelleration. The evidence is the fantastic speed (16B/clock,
>>equal to peak capabilities of D$) this instructions demonstrate on Nehalem/SandyBridge.
>>IIRC, Cortex-A9 achieves 2 or 4B/clock despite its L1D hardware being capable of 8B/clock.
>
>Are you talking about bandwidth to L1, L2, main memory, or what?
L1
>Cortex-A9 definitely
>isn't limited to 2 or 4 bytes per cycle to L1, you don't have a problem getting
>full bandwidth with ldm/stm.
2 or 4B/clock average for the whole memcpy, not for individual stm/ldm.
Or are you saying that Cortex-A9 could memcpy, say, 1000 dword-aligned L1-resident bytes faster than in 250 (or 500, I don't remember) clocks without resorting to Neon?
>When not going to cache these instructions generates
>appropriately lengthed bursts on the bus, there's no reason why they they don't
>provide for similar acceleration. I don't see how having burst lengths in the dozens
>of bytes instead of hundreds or thousands is going to make a big difference.
>
Don't ask me about reasons. I designed neither Nehalem nor Cortex-A9. Probably something to do with startup/shutdown overhead and with ability to hide the latency.
>Of course on ARM64 you just get load/store pairs, but you also still have instructions
>that can load/store up to 4 16-byte NEON vectors.
>
---------------------------
>Michael S (already5chosen@yahoo.com) on 12/21/11 wrote:
>---------------------------
>>
>>On the other hand, when memcpy is long then x86 rem moves semantics provide more
>>opportunities for hw accelleration. The evidence is the fantastic speed (16B/clock,
>>equal to peak capabilities of D$) this instructions demonstrate on Nehalem/SandyBridge.
>>IIRC, Cortex-A9 achieves 2 or 4B/clock despite its L1D hardware being capable of 8B/clock.
>
>Are you talking about bandwidth to L1, L2, main memory, or what?
L1
>Cortex-A9 definitely
>isn't limited to 2 or 4 bytes per cycle to L1, you don't have a problem getting
>full bandwidth with ldm/stm.
2 or 4B/clock average for the whole memcpy, not for individual stm/ldm.
Or are you saying that Cortex-A9 could memcpy, say, 1000 dword-aligned L1-resident bytes faster than in 250 (or 500, I don't remember) clocks without resorting to Neon?
>When not going to cache these instructions generates
>appropriately lengthed bursts on the bus, there's no reason why they they don't
>provide for similar acceleration. I don't see how having burst lengths in the dozens
>of bytes instead of hundreds or thousands is going to make a big difference.
>
Don't ask me about reasons. I designed neither Nehalem nor Cortex-A9. Probably something to do with startup/shutdown overhead and with ability to hide the latency.
>Of course on ARM64 you just get load/store pairs, but you also still have instructions
>that can load/store up to 4 16-byte NEON vectors.
>