Article: AMD's Mobile Strategy
By: S. Rao (sonny.rao.delete@this.gmail.com), December 21, 2011 9:12 pm
Room: Moderated Discussions
Wilco (Wilco.Dijkstra@ntlworld.com) on 12/21/11 wrote:
---------------------------
>However 5 instructions, ie. 20 bytes to make a 8-byte immediate is rather silly.
>The best latency is 3 cycles. A load would not only have similar latency (Cortex-A9
>has 2 cycle load-latency for forwarded cases) but allow other independent operations
>to be executed as well (due to not using up all the fetch/decode/execute resources).
Can you elaborate on this? I've had a difficult time finding
info about Cortex A9 core pipeline.
(BTW, why is it so difficult to find such basic information?)
Everything I've seen says the L1 latency is 4 cycles.
For example see the GCC machine description:
http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01858.html
"+;; Loads have a latency of 4 cycles."
running lat_mem_rd from lmbench also corroborates the 4 cycles
The next best info I've found is this:
http://arm.com/files/downloads/Cortex-A9_Devcon_2007_Microarchitecture.pdf
Which has some nice diagrams and info about the branch predictor,
but nothing I've seen says there's a 2 cycle load-latency for forwarded cases. Are you omitting the address generation part or
something?
---------------------------
>However 5 instructions, ie. 20 bytes to make a 8-byte immediate is rather silly.
>The best latency is 3 cycles. A load would not only have similar latency (Cortex-A9
>has 2 cycle load-latency for forwarded cases) but allow other independent operations
>to be executed as well (due to not using up all the fetch/decode/execute resources).
Can you elaborate on this? I've had a difficult time finding
info about Cortex A9 core pipeline.
(BTW, why is it so difficult to find such basic information?)
Everything I've seen says the L1 latency is 4 cycles.
For example see the GCC machine description:
http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01858.html
"+;; Loads have a latency of 4 cycles."
running lat_mem_rd from lmbench also corroborates the 4 cycles
The next best info I've found is this:
http://arm.com/files/downloads/Cortex-A9_Devcon_2007_Microarchitecture.pdf
Which has some nice diagrams and info about the branch predictor,
but nothing I've seen says there's a 2 cycle load-latency for forwarded cases. Are you omitting the address generation part or
something?