Article: AMD's Mobile Strategy
By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), December 21, 2011 6:06 pm
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 12/21/11 wrote:
---------------------------
>Exophase (exophase@gmail.com) on 12/21/11 wrote:
>---------------------------
>>Michael S (already5chosen@yahoo.com) on 12/21/11 wrote:
>>---------------------------
>>>
>>>On the other hand, when memcpy is long then x86 rem moves semantics provide more
>>>opportunities for hw accelleration. The evidence is the fantastic speed (16B/clock,
>>>equal to peak capabilities of D$) this instructions demonstrate on Nehalem/SandyBridge.
>>>IIRC, Cortex-A9 achieves 2 or 4B/clock despite its L1D hardware being capable of 8B/clock.
You meant "4B/clk copy because its L1D is capable of max 8B/clk load/store".
There is no inherent advantage of rep movsb/w/d/q over ldm/stm. Nehalem/SB is faster only because it has a wider L1 interface and 2 ports, so given this I'd expect the A9 to be exactly 4 times slower. The A15 has a 128-bit wide Neon engine, and thus most likely a 128-bit L1 interface with 2 ports, so could theoretically beat SB/Nehalem as ldm/stm does not have any startup or latency overheads unlike rep movs.
>>Are you talking about bandwidth to L1, L2, main memory, or what?
>
>L1
>
>>Cortex-A9 definitely
>>isn't limited to 2 or 4 bytes per cycle to L1, you don't have a problem getting
>>full bandwidth with ldm/stm.
>
>2 or 4B/clock average for the whole memcpy, not for individual stm/ldm.
>Or are you saying that Cortex-A9 could memcpy, say, 1000 dword-aligned L1-resident
>bytes faster than in 250 (or 500, I don't remember) clocks without resorting to Neon?
Neon has a 64-bit interface too on both the A8 and A9, so isn't any faster. I haven't benchmarked it on A9 but depending on the size and alignment Neon was ~10-15% faster in L1 to L1 copies on the A8. However Neon is not too great at doing the final few bytes, and on the A8 you got nasty stalls when you access data written by Neon from the integer side.
Wilco
---------------------------
>Exophase (exophase@gmail.com) on 12/21/11 wrote:
>---------------------------
>>Michael S (already5chosen@yahoo.com) on 12/21/11 wrote:
>>---------------------------
>>>
>>>On the other hand, when memcpy is long then x86 rem moves semantics provide more
>>>opportunities for hw accelleration. The evidence is the fantastic speed (16B/clock,
>>>equal to peak capabilities of D$) this instructions demonstrate on Nehalem/SandyBridge.
>>>IIRC, Cortex-A9 achieves 2 or 4B/clock despite its L1D hardware being capable of 8B/clock.
You meant "4B/clk copy because its L1D is capable of max 8B/clk load/store".
There is no inherent advantage of rep movsb/w/d/q over ldm/stm. Nehalem/SB is faster only because it has a wider L1 interface and 2 ports, so given this I'd expect the A9 to be exactly 4 times slower. The A15 has a 128-bit wide Neon engine, and thus most likely a 128-bit L1 interface with 2 ports, so could theoretically beat SB/Nehalem as ldm/stm does not have any startup or latency overheads unlike rep movs.
>>Are you talking about bandwidth to L1, L2, main memory, or what?
>
>L1
>
>>Cortex-A9 definitely
>>isn't limited to 2 or 4 bytes per cycle to L1, you don't have a problem getting
>>full bandwidth with ldm/stm.
>
>2 or 4B/clock average for the whole memcpy, not for individual stm/ldm.
>Or are you saying that Cortex-A9 could memcpy, say, 1000 dword-aligned L1-resident
>bytes faster than in 250 (or 500, I don't remember) clocks without resorting to Neon?
Neon has a 64-bit interface too on both the A8 and A9, so isn't any faster. I haven't benchmarked it on A9 but depending on the size and alignment Neon was ~10-15% faster in L1 to L1 copies on the A8. However Neon is not too great at doing the final few bytes, and on the A8 you got nasty stalls when you access data written by Neon from the integer side.
Wilco