Article: AMD's Mobile Strategy
By: Michael S (already5chosen.delete@this.yahoo.com), December 21, 2011 2:22 am
Room: Moderated Discussions
Exophase (exophase@gmail.com) on 12/20/11 wrote:
---------------------------
>Linus Torvalds (torvalds@linux-foundation.org) on 12/20/11 wrote:
>---------------------------
>>ARM doesn't have anywhere near the kinds of address
>>generation that x86 has.
>>
>>The whole "base+small offset/shifted index" is just a tiny
>>part of the equation. Static addresses are common, and are
>>part of that "x86 has much more flexible immediates" that
>>you dismissed so cavalierly.
>>
>>Big PC-relative offsets and larger immediates are a big
>>part of address generation. And things like thread-local
>>storage is actually important too these days.
>>
>>You do realize that even the oft-maligned x86 segmenting
>>is actually used again? Using a segment for thread-
>>local storage is actually nice. Having access to it from
>>a CISC instruction set in a single instruction also has
>>real threading advantages, because you have the hardware
>>giving atomicity guarantees wrt NMI's and other events,
>>even in user space.
>>
>>So there really are advantages to the x86 instruction set.
>>In the kernel, we do a lot of per-cpu things, because
>>we care about scalability more than the average bear, and
>>it's a real advantage how we can do a per-cpu increment
>>with a single instruction, exactly because that way we do
>>not need to disable interrupts or preemption.
>>
>>And part of that is the read-modify-write ops, but part of
>>it is also the addressing modes: using a segment prefix to
>>cause the operation to go to the percpu area.
>>
>>So some memory op issues cause x86 instructions to be
>>much more powerful, because it has secondary
>>effects.
>>
>>Linus
>
>I don't know why you think I'm disregarding x86's strengths when I list them out
>pretty categorically. No one is saying that absolute addressing or even segmentation
>is useless. Yes, I didn't list absolute addresses specifically, I figured that was
>redundant with reg + reg + imm even if the latter doesn't necessarily have to allow for the former.
>
>I appreciate that the technological relevance something has with respect to the
>kernel is important to you, and I'm sure I could find more examples to support your
>claims. I could also list several examples where folded shifts, predication, and
>post-adjust addressing have made a big difference in inner loops of my ARM code.
>I won't bother making an argument about these things because it's truly anecdotal.
>
>I'm looking for more current references for x86 instruction category histograms. Here's one that's fairly recent one:
>
>http://www.strchr.com/x86_machine_code_statistics
>
>This lists absolute addressing is only being 1% of operand types. Mind you, this
>is by program scan and not by dynamic usage, but I would hardly expect that they'd
>be more frequently appearing in execution since you generally want to keep global
>variables out of your tight inner loops.
>
>I could find some other histograms but they're really old, and let's face it -
>code style has changed. Old 16-bit statistics are especially too far removed. If
>you have others please feel free to provide them.
>
>What I CAN provide is some ARM statistics. I have a Nintendo DS emulator capable
>of profiling ARM instruction frequency (dynamically, by execution). I've played
>four games for a few minutes (I only looked at ARM code, not Thumb: most games are
>more ARM heavy). Two were 2D and two were 3D. This is what I found:
>
>- 45.3% of instructions are ALU operations
>- 51% of ALU instructions are reg/reg, 49% are reg/imm
>- 40.66% of reg/reg ALU instructions have an inlined shift
>
>So about 9.3% of instructions use inlined shifts. This number is however inflated
>since it counts mov as an ALU instruction, so all shifts are included. But in my
>experience shifts can be folded more often than not.
I'd guess that at least half of those are ADDs with left shift by [1..3]. I.e, no better that x86 LEA.
>
>- 24.2% of instructions are memory operations
>- 70.8% of memory operations are loads, 29.2% are stores
>- 6.7% of memory operations are post-adjust (but this value varies a lot more than the others)
>
>So about 1.6% of instructions use post-adjust addressing.
>
People use pre and post-adjust when available. When unavailable, people find other ways to achieve the same objectives, typically with very small overhead. That's where x86 3-component addressing shines.
Also take into account that higher-end OoO ARM cores would have to either crack GPR loads with update option (as high-end Power/PPC) or issue them simultaneously through a couple of execution ports. So, you wouldn't see much of energy saving.
>And about 3.72% of instructions are block memory (no details on average register
>count, sorry). And no, ARM's block memory instructions and x86's rep movs and friends
>are NOT the same. For copying large blocks of memory they're close to the same,
>but ARM's block memory instructions are more commonly used for saving and restoring
>registers. Which happens all the time at function prologues and epilogues, and before
>function calls for that matter. x86 has pusha but that does all registers (but sp)
>and implementations have greatly neglected it to the point where it's never used
>anymore. ARM block memory instructions, on the other hand, have been maintained as decent performers.
>
On the other hand, when memcpy is long then x86 rem moves semantics provide more opportunities for hw accelleration. The evidence is the fantastic speed (16B/clock, equal to peak capabilities of D$) this instructions demonstrate on Nehalem/SandyBridge. IIRC, Cortex-A9 achieves 2 or 4B/clock despite its L1D hardware being capable of 8B/clock.
>I'd like to give data on predication but my profiling isn't very good with that:
>it tracks predication type, taken branches, and untaken instructions in general,
>but it doesn't distinguish types for untakens. I'll need to adjust it to get more information.
---------------------------
>Linus Torvalds (torvalds@linux-foundation.org) on 12/20/11 wrote:
>---------------------------
>>ARM doesn't have anywhere near the kinds of address
>>generation that x86 has.
>>
>>The whole "base+small offset/shifted index" is just a tiny
>>part of the equation. Static addresses are common, and are
>>part of that "x86 has much more flexible immediates" that
>>you dismissed so cavalierly.
>>
>>Big PC-relative offsets and larger immediates are a big
>>part of address generation. And things like thread-local
>>storage is actually important too these days.
>>
>>You do realize that even the oft-maligned x86 segmenting
>>is actually used again? Using a segment for thread-
>>local storage is actually nice. Having access to it from
>>a CISC instruction set in a single instruction also has
>>real threading advantages, because you have the hardware
>>giving atomicity guarantees wrt NMI's and other events,
>>even in user space.
>>
>>So there really are advantages to the x86 instruction set.
>>In the kernel, we do a lot of per-cpu things, because
>>we care about scalability more than the average bear, and
>>it's a real advantage how we can do a per-cpu increment
>>with a single instruction, exactly because that way we do
>>not need to disable interrupts or preemption.
>>
>>And part of that is the read-modify-write ops, but part of
>>it is also the addressing modes: using a segment prefix to
>>cause the operation to go to the percpu area.
>>
>>So some memory op issues cause x86 instructions to be
>>much more powerful, because it has secondary
>>effects.
>>
>>Linus
>
>I don't know why you think I'm disregarding x86's strengths when I list them out
>pretty categorically. No one is saying that absolute addressing or even segmentation
>is useless. Yes, I didn't list absolute addresses specifically, I figured that was
>redundant with reg + reg + imm even if the latter doesn't necessarily have to allow for the former.
>
>I appreciate that the technological relevance something has with respect to the
>kernel is important to you, and I'm sure I could find more examples to support your
>claims. I could also list several examples where folded shifts, predication, and
>post-adjust addressing have made a big difference in inner loops of my ARM code.
>I won't bother making an argument about these things because it's truly anecdotal.
>
>I'm looking for more current references for x86 instruction category histograms. Here's one that's fairly recent one:
>
>http://www.strchr.com/x86_machine_code_statistics
>
>This lists absolute addressing is only being 1% of operand types. Mind you, this
>is by program scan and not by dynamic usage, but I would hardly expect that they'd
>be more frequently appearing in execution since you generally want to keep global
>variables out of your tight inner loops.
>
>I could find some other histograms but they're really old, and let's face it -
>code style has changed. Old 16-bit statistics are especially too far removed. If
>you have others please feel free to provide them.
>
>What I CAN provide is some ARM statistics. I have a Nintendo DS emulator capable
>of profiling ARM instruction frequency (dynamically, by execution). I've played
>four games for a few minutes (I only looked at ARM code, not Thumb: most games are
>more ARM heavy). Two were 2D and two were 3D. This is what I found:
>
>- 45.3% of instructions are ALU operations
>- 51% of ALU instructions are reg/reg, 49% are reg/imm
>- 40.66% of reg/reg ALU instructions have an inlined shift
>
>So about 9.3% of instructions use inlined shifts. This number is however inflated
>since it counts mov as an ALU instruction, so all shifts are included. But in my
>experience shifts can be folded more often than not.
I'd guess that at least half of those are ADDs with left shift by [1..3]. I.e, no better that x86 LEA.
>
>- 24.2% of instructions are memory operations
>- 70.8% of memory operations are loads, 29.2% are stores
>- 6.7% of memory operations are post-adjust (but this value varies a lot more than the others)
>
>So about 1.6% of instructions use post-adjust addressing.
>
People use pre and post-adjust when available. When unavailable, people find other ways to achieve the same objectives, typically with very small overhead. That's where x86 3-component addressing shines.
Also take into account that higher-end OoO ARM cores would have to either crack GPR loads with update option (as high-end Power/PPC) or issue them simultaneously through a couple of execution ports. So, you wouldn't see much of energy saving.
>And about 3.72% of instructions are block memory (no details on average register
>count, sorry). And no, ARM's block memory instructions and x86's rep movs and friends
>are NOT the same. For copying large blocks of memory they're close to the same,
>but ARM's block memory instructions are more commonly used for saving and restoring
>registers. Which happens all the time at function prologues and epilogues, and before
>function calls for that matter. x86 has pusha but that does all registers (but sp)
>and implementations have greatly neglected it to the point where it's never used
>anymore. ARM block memory instructions, on the other hand, have been maintained as decent performers.
>
On the other hand, when memcpy is long then x86 rem moves semantics provide more opportunities for hw accelleration. The evidence is the fantastic speed (16B/clock, equal to peak capabilities of D$) this instructions demonstrate on Nehalem/SandyBridge. IIRC, Cortex-A9 achieves 2 or 4B/clock despite its L1D hardware being capable of 8B/clock.
>I'd like to give data on predication but my profiling isn't very good with that:
>it tracks predication type, taken branches, and untaken instructions in general,
>but it doesn't distinguish types for untakens. I'll need to adjust it to get more information.