Article: AMD's Mobile Strategy
By: anon (anon.delete@this.anon.com), December 21, 2011 8:15 pm
Room: Moderated Discussions
Wilco (Wilco.Dijkstra@ntlworld.com) on 12/21/11 wrote:
---------------------------
>Michael S (already5chosen@yahoo.com) on 12/21/11 wrote:
>---------------------------
>>Exophase (exophase@gmail.com) on 12/21/11 wrote:
>>---------------------------
>>>Michael S (already5chosen@yahoo.com) on 12/21/11 wrote:
>>>---------------------------
>>>>
>>>>On the other hand, when memcpy is long then x86 rem moves semantics provide more
>>>>opportunities for hw accelleration. The evidence is the fantastic speed (16B/clock,
>>>>equal to peak capabilities of D$) this instructions demonstrate on Nehalem/SandyBridge.
>>>>IIRC, Cortex-A9 achieves 2 or 4B/clock despite its L1D hardware being capable of 8B/clock.
>
>You meant "4B/clk copy because its L1D is capable of max 8B/clk load/store".
>
>There is no inherent advantage of rep movsb/w/d/q over ldm/stm. Nehalem/SB is faster
>only because it has a wider L1 interface and 2 ports, so given this I'd expect the
>A9 to be exactly 4 times slower. The A15 has a 128-bit wide Neon engine, and thus
>most likely a 128-bit L1 interface with 2 ports, so could theoretically beat SB/Nehalem
>as ldm/stm does not have any startup or latency overheads unlike rep movs.
"fast string operations" are horrible for small ops, unfortunately, so they're unsuitable for generic mem/string operations in C. I hope they will get better with newer architectures.
Basic memory copying and setting are CISC instructions I can really get behind, because the hardware should know much more about caches, prefetchers, etc.
Although I suspect that they have a relatively larger advantage in x86, which already has complex decoding, and can take advantage of weak store ordering within the operation. So other ISAs probably wouldn't bother.
>
>>>Are you talking about bandwidth to L1, L2, main memory, or what?
>>
>>L1
>>
>>>Cortex-A9 definitely
>>>isn't limited to 2 or 4 bytes per cycle to L1, you don't have a problem getting
>>>full bandwidth with ldm/stm.
>>
>>2 or 4B/clock average for the whole memcpy, not for individual stm/ldm.
>>Or are you saying that Cortex-A9 could memcpy, say, 1000 dword-aligned L1-resident
>>bytes faster than in 250 (or 500, I don't remember) clocks without resorting to Neon?
>
>Neon has a 64-bit interface too on both the A8 and A9, so isn't any faster. I haven't
>benchmarked it on A9 but depending on the size and alignment Neon was ~10-15% faster
>in L1 to L1 copies on the A8. However Neon is not too great at doing the final few
>bytes, and on the A8 you got nasty stalls when you access data written by Neon from the integer side.
>
>Wilco
---------------------------
>Michael S (already5chosen@yahoo.com) on 12/21/11 wrote:
>---------------------------
>>Exophase (exophase@gmail.com) on 12/21/11 wrote:
>>---------------------------
>>>Michael S (already5chosen@yahoo.com) on 12/21/11 wrote:
>>>---------------------------
>>>>
>>>>On the other hand, when memcpy is long then x86 rem moves semantics provide more
>>>>opportunities for hw accelleration. The evidence is the fantastic speed (16B/clock,
>>>>equal to peak capabilities of D$) this instructions demonstrate on Nehalem/SandyBridge.
>>>>IIRC, Cortex-A9 achieves 2 or 4B/clock despite its L1D hardware being capable of 8B/clock.
>
>You meant "4B/clk copy because its L1D is capable of max 8B/clk load/store".
>
>There is no inherent advantage of rep movsb/w/d/q over ldm/stm. Nehalem/SB is faster
>only because it has a wider L1 interface and 2 ports, so given this I'd expect the
>A9 to be exactly 4 times slower. The A15 has a 128-bit wide Neon engine, and thus
>most likely a 128-bit L1 interface with 2 ports, so could theoretically beat SB/Nehalem
>as ldm/stm does not have any startup or latency overheads unlike rep movs.
"fast string operations" are horrible for small ops, unfortunately, so they're unsuitable for generic mem/string operations in C. I hope they will get better with newer architectures.
Basic memory copying and setting are CISC instructions I can really get behind, because the hardware should know much more about caches, prefetchers, etc.
Although I suspect that they have a relatively larger advantage in x86, which already has complex decoding, and can take advantage of weak store ordering within the operation. So other ISAs probably wouldn't bother.
>
>>>Are you talking about bandwidth to L1, L2, main memory, or what?
>>
>>L1
>>
>>>Cortex-A9 definitely
>>>isn't limited to 2 or 4 bytes per cycle to L1, you don't have a problem getting
>>>full bandwidth with ldm/stm.
>>
>>2 or 4B/clock average for the whole memcpy, not for individual stm/ldm.
>>Or are you saying that Cortex-A9 could memcpy, say, 1000 dword-aligned L1-resident
>>bytes faster than in 250 (or 500, I don't remember) clocks without resorting to Neon?
>
>Neon has a 64-bit interface too on both the A8 and A9, so isn't any faster. I haven't
>benchmarked it on A9 but depending on the size and alignment Neon was ~10-15% faster
>in L1 to L1 copies on the A8. However Neon is not too great at doing the final few
>bytes, and on the A8 you got nasty stalls when you access data written by Neon from the integer side.
>
>Wilco