Article: AMD's Mobile Strategy
By: Exophase (exophase.delete@this.gmail.com), December 21, 2011 10:15 am
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 12/21/11 wrote:
---------------------------
>2 or 4B/clock average for the whole memcpy, not for individual stm/ldm.
>Or are you saying that Cortex-A9 could memcpy, say, 1000 dword-aligned L1-resident
>bytes faster than in 250 (or 500, I don't remember) clocks without resorting to Neon?
Sorry, I missed the point about this being about copies instead of loads and stores.
No ARM CPU currently out has simultaneous load and store access to L1 like x86 CPUs have had for years (barring Atom, of course). Cortex-A15 has it, so we'll have to see what it's capable of for ldm/stm throughput. My guess is that the loads and stores get turned into individual 64 or 128-bit accesses (whatever the L1 width is) and have no real problem pairing, along with whatever loop control there is. It might be a bad idea to use especially large ldm/stm if it fills the issue queues.
But I doubt saturating L1 dcache bandwidth will be a big problem. All it boils down to is whether or not it can access the whole width in one transaction, and either ldm/stm or the 64-bit loads/stores will allow this. You don't need a rep movs instruction.
---------------------------
>2 or 4B/clock average for the whole memcpy, not for individual stm/ldm.
>Or are you saying that Cortex-A9 could memcpy, say, 1000 dword-aligned L1-resident
>bytes faster than in 250 (or 500, I don't remember) clocks without resorting to Neon?
Sorry, I missed the point about this being about copies instead of loads and stores.
No ARM CPU currently out has simultaneous load and store access to L1 like x86 CPUs have had for years (barring Atom, of course). Cortex-A15 has it, so we'll have to see what it's capable of for ldm/stm throughput. My guess is that the loads and stores get turned into individual 64 or 128-bit accesses (whatever the L1 width is) and have no real problem pairing, along with whatever loop control there is. It might be a bad idea to use especially large ldm/stm if it fills the issue queues.
But I doubt saturating L1 dcache bandwidth will be a big problem. All it boils down to is whether or not it can access the whole width in one transaction, and either ldm/stm or the 64-bit loads/stores will allow this. You don't need a rep movs instruction.