Article: AMD's Mobile Strategy
By: Michael S (already5chosen.delete@this.yahoo.com), December 17, 2011 1:12 pm
Room: Moderated Discussions
Wilco (Wilco.Dijkstra@ntlworld.com) on 12/17/11 wrote:
---------------------------
>Michael S (already5chosen@yahoo.com) on 12/17/11 wrote:
>---------------------------
>>Exophase (exophase@gmail.com) on 12/16/11 wrote:
>>---------------------------
>>>
>>>store: 2 uops (this is seriously the friggin silver bullet against your "uops =
>>>RISC" argument, please tell me the RISC uarch that takes two cycles for stores)
>>
>>Ignoring your stupid "2 uops=2 clocks" equivalence...
>>Power4 "cracks" all integer stores that use reg+reg addressing. That is roughly
>>equivalent to Intel's generation of 2 uOps, although less helpful for OoO scheduling.
>>I didn't read sufficiently detailed docs on Power5/6/7 to know whether they crack
>>stores in the same way. My personal guess: Power5 and Power7, Power6 don't.
>>
>>ARM Cortex-A9 is uArch is rather poorly documented, even relatively to Power, but
>>it would surprise me if they don't "crack" integer stores with reg+reg addressing
>>or don't issue them simultaneously through 2 issue ports, which is almost the same thing.
>
>No, [reg, +reg] and [reg, +reg lsl #2] take a single AGU cycle like [reg, imm]
>addressing modes. More complex addressing modes do the shift in the integer ALU,
>then forward to the AGU (at least that is what the timings suggest).
>
>See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/CIAECBEB.html
>
>The A9 never cracks instructions, but some like shift+add do spend more than 1 cycle in an execution unit.
>
>>When implementing integer store on OoO core, one should find a way around limitation
>>of 2 GPR inputs per uOp and Intel's 2 uOps is just one of such ways, not the most
>>economical, but certainly most flexible as far as further scheduling concerned.
>
>The A9 has no such limit.
>
>Wilco
Did you actually micro-benchmark it?
It should be quite easy to time a long chain of [reg+reg] stores interleaved with simple alu ops and see whether Cortex-A9 sustains 2 instructions=5 inputs per clock.
---------------------------
>Michael S (already5chosen@yahoo.com) on 12/17/11 wrote:
>---------------------------
>>Exophase (exophase@gmail.com) on 12/16/11 wrote:
>>---------------------------
>>>
>>>store: 2 uops (this is seriously the friggin silver bullet against your "uops =
>>>RISC" argument, please tell me the RISC uarch that takes two cycles for stores)
>>
>>Ignoring your stupid "2 uops=2 clocks" equivalence...
>>Power4 "cracks" all integer stores that use reg+reg addressing. That is roughly
>>equivalent to Intel's generation of 2 uOps, although less helpful for OoO scheduling.
>>I didn't read sufficiently detailed docs on Power5/6/7 to know whether they crack
>>stores in the same way. My personal guess: Power5 and Power7, Power6 don't.
>>
>>ARM Cortex-A9 is uArch is rather poorly documented, even relatively to Power, but
>>it would surprise me if they don't "crack" integer stores with reg+reg addressing
>>or don't issue them simultaneously through 2 issue ports, which is almost the same thing.
>
>No, [reg, +reg] and [reg, +reg lsl #2] take a single AGU cycle like [reg, imm]
>addressing modes. More complex addressing modes do the shift in the integer ALU,
>then forward to the AGU (at least that is what the timings suggest).
>
>See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/CIAECBEB.html
>
>The A9 never cracks instructions, but some like shift+add do spend more than 1 cycle in an execution unit.
>
>>When implementing integer store on OoO core, one should find a way around limitation
>>of 2 GPR inputs per uOp and Intel's 2 uOps is just one of such ways, not the most
>>economical, but certainly most flexible as far as further scheduling concerned.
>
>The A9 has no such limit.
>
>Wilco
Did you actually micro-benchmark it?
It should be quite easy to time a long chain of [reg+reg] stores interleaved with simple alu ops and see whether Cortex-A9 sustains 2 instructions=5 inputs per clock.