Article: AMD's Mobile Strategy
By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), December 22, 2011 4:59 am
Room: Moderated Discussions
S. Rao (sonny.rao@gmail.com) on 12/21/11 wrote:
---------------------------
>Wilco (Wilco.Dijkstra@ntlworld.com) on 12/21/11 wrote:
>---------------------------
>
>>However 5 instructions, ie. 20 bytes to make a 8-byte immediate is rather silly.
>>The best latency is 3 cycles. A load would not only have similar latency (Cortex-A9
>>has 2 cycle load-latency for forwarded cases) but allow other independent operations
>>to be executed as well (due to not using up all the fetch/decode/execute resources).
>
>Can you elaborate on this? I've had a difficult time finding
>info about Cortex A9 core pipeline.
>(BTW, why is it so difficult to find such basic information?)
>Everything I've seen says the L1 latency is 4 cycles.
Check out page 200 of the TRM: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388g/DDI0388G_cortex_a9_r3p0_trm.pdf
The typical definition of "result latency" is the number of cycles you would stall if you use the loaded result immediately. So total latency would be 3 cycles.
>For example see the GCC machine description:
>http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01858.html
>
>"+;; Loads have a latency of 4 cycles."
>
>running lat_mem_rd from lmbench also corroborates the 4 cycles
That would match the not-forwarded case. In the initial design there was a special forwarding path for load-load cases but that may have been too difficult to keep (or too useless besides benchmarks).
>The next best info I've found is this:
>http://arm.com/files/downloads/Cortex-A9_Devcon_2007_Microarchitecture.pdf
>>
>
>Which has some nice diagrams and info about the branch predictor,
>but nothing I've seen says there's a 2 cycle load-latency for forwarded cases.
>Are you omitting the address generation part or
>something?
No, it's just a different definition of latency. If you look at page 124 it shows the cache lookup takes just 1 cycle, with 1 cycle for AGU and 1 cycle for tags comparison and fast forwarding, ie. minimum latency is 3 cycles.
Wilco
---------------------------
>Wilco (Wilco.Dijkstra@ntlworld.com) on 12/21/11 wrote:
>---------------------------
>
>>However 5 instructions, ie. 20 bytes to make a 8-byte immediate is rather silly.
>>The best latency is 3 cycles. A load would not only have similar latency (Cortex-A9
>>has 2 cycle load-latency for forwarded cases) but allow other independent operations
>>to be executed as well (due to not using up all the fetch/decode/execute resources).
>
>Can you elaborate on this? I've had a difficult time finding
>info about Cortex A9 core pipeline.
>(BTW, why is it so difficult to find such basic information?)
>Everything I've seen says the L1 latency is 4 cycles.
Check out page 200 of the TRM: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388g/DDI0388G_cortex_a9_r3p0_trm.pdf
The typical definition of "result latency" is the number of cycles you would stall if you use the loaded result immediately. So total latency would be 3 cycles.
>For example see the GCC machine description:
>http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01858.html
>
>"+;; Loads have a latency of 4 cycles."
>
>running lat_mem_rd from lmbench also corroborates the 4 cycles
That would match the not-forwarded case. In the initial design there was a special forwarding path for load-load cases but that may have been too difficult to keep (or too useless besides benchmarks).
>The next best info I've found is this:
>http://arm.com/files/downloads/Cortex-A9_Devcon_2007_Microarchitecture.pdf
>>
>
>Which has some nice diagrams and info about the branch predictor,
>but nothing I've seen says there's a 2 cycle load-latency for forwarded cases.
>Are you omitting the address generation part or
>something?
No, it's just a different definition of latency. If you look at page 124 it shows the cache lookup takes just 1 cycle, with 1 cycle for AGU and 1 cycle for tags comparison and fast forwarding, ie. minimum latency is 3 cycles.
Wilco