Article: AMD's Mobile Strategy
By: Exophase (exophase.delete@this.gmail.com), December 20, 2011 7:19 pm
Room: Moderated Discussions
Linus Torvalds (torvalds@linux-foundation.org) on 12/20/11 wrote:
---------------------------
>ARM doesn't have anywhere near the kinds of address
>generation that x86 has.
>
>The whole "base+small offset/shifted index" is just a tiny
>part of the equation. Static addresses are common, and are
>part of that "x86 has much more flexible immediates" that
>you dismissed so cavalierly.
>
>Big PC-relative offsets and larger immediates are a big
>part of address generation. And things like thread-local
>storage is actually important too these days.
>
>You do realize that even the oft-maligned x86 segmenting
>is actually used again? Using a segment for thread-
>local storage is actually nice. Having access to it from
>a CISC instruction set in a single instruction also has
>real threading advantages, because you have the hardware
>giving atomicity guarantees wrt NMI's and other events,
>even in user space.
>
>So there really are advantages to the x86 instruction set.
>In the kernel, we do a lot of per-cpu things, because
>we care about scalability more than the average bear, and
>it's a real advantage how we can do a per-cpu increment
>with a single instruction, exactly because that way we do
>not need to disable interrupts or preemption.
>
>And part of that is the read-modify-write ops, but part of
>it is also the addressing modes: using a segment prefix to
>cause the operation to go to the percpu area.
>
>So some memory op issues cause x86 instructions to be
>much more powerful, because it has secondary
>effects.
>
>Linus
I don't know why you think I'm disregarding x86's strengths when I list them out pretty categorically. No one is saying that absolute addressing or even segmentation is useless. Yes, I didn't list absolute addresses specifically, I figured that was redundant with reg + reg + imm even if the latter doesn't necessarily have to allow for the former.
I appreciate that the technological relevance something has with respect to the kernel is important to you, and I'm sure I could find more examples to support your claims. I could also list several examples where folded shifts, predication, and post-adjust addressing have made a big difference in inner loops of my ARM code. I won't bother making an argument about these things because it's truly anecdotal.
I'm looking for more current references for x86 instruction category histograms. Here's one that's fairly recent one:
http://www.strchr.com/x86_machine_code_statistics
This lists absolute addressing is only being 1% of operand types. Mind you, this is by program scan and not by dynamic usage, but I would hardly expect that they'd be more frequently appearing in execution since you generally want to keep global variables out of your tight inner loops.
I could find some other histograms but they're really old, and let's face it - code style has changed. Old 16-bit statistics are especially too far removed. If you have others please feel free to provide them.
What I CAN provide is some ARM statistics. I have a Nintendo DS emulator capable of profiling ARM instruction frequency (dynamically, by execution). I've played four games for a few minutes (I only looked at ARM code, not Thumb: most games are more ARM heavy). Two were 2D and two were 3D. This is what I found:
- 45.3% of instructions are ALU operations
- 51% of ALU instructions are reg/reg, 49% are reg/imm
- 40.66% of reg/reg ALU instructions have an inlined shift
So about 9.3% of instructions use inlined shifts. This number is however inflated since it counts mov as an ALU instruction, so all shifts are included. But in my experience shifts can be folded more often than not.
- 24.2% of instructions are memory operations
- 70.8% of memory operations are loads, 29.2% are stores
- 6.7% of memory operations are post-adjust (but this value varies a lot more than the others)
So about 1.6% of instructions use post-adjust addressing.
And about 3.72% of instructions are block memory (no details on average register count, sorry). And no, ARM's block memory instructions and x86's rep movs and friends are NOT the same. For copying large blocks of memory they're close to the same, but ARM's block memory instructions are more commonly used for saving and restoring registers. Which happens all the time at function prologues and epilogues, and before function calls for that matter. x86 has pusha but that does all registers (but sp) and implementations have greatly neglected it to the point where it's never used anymore. ARM block memory instructions, on the other hand, have been maintained as decent performers.
I'd like to give data on predication but my profiling isn't very good with that: it tracks predication type, taken branches, and untaken instructions in general, but it doesn't distinguish types for untakens. I'll need to adjust it to get more information.
---------------------------
>ARM doesn't have anywhere near the kinds of address
>generation that x86 has.
>
>The whole "base+small offset/shifted index" is just a tiny
>part of the equation. Static addresses are common, and are
>part of that "x86 has much more flexible immediates" that
>you dismissed so cavalierly.
>
>Big PC-relative offsets and larger immediates are a big
>part of address generation. And things like thread-local
>storage is actually important too these days.
>
>You do realize that even the oft-maligned x86 segmenting
>is actually used again? Using a segment for thread-
>local storage is actually nice. Having access to it from
>a CISC instruction set in a single instruction also has
>real threading advantages, because you have the hardware
>giving atomicity guarantees wrt NMI's and other events,
>even in user space.
>
>So there really are advantages to the x86 instruction set.
>In the kernel, we do a lot of per-cpu things, because
>we care about scalability more than the average bear, and
>it's a real advantage how we can do a per-cpu increment
>with a single instruction, exactly because that way we do
>not need to disable interrupts or preemption.
>
>And part of that is the read-modify-write ops, but part of
>it is also the addressing modes: using a segment prefix to
>cause the operation to go to the percpu area.
>
>So some memory op issues cause x86 instructions to be
>much more powerful, because it has secondary
>effects.
>
>Linus
I don't know why you think I'm disregarding x86's strengths when I list them out pretty categorically. No one is saying that absolute addressing or even segmentation is useless. Yes, I didn't list absolute addresses specifically, I figured that was redundant with reg + reg + imm even if the latter doesn't necessarily have to allow for the former.
I appreciate that the technological relevance something has with respect to the kernel is important to you, and I'm sure I could find more examples to support your claims. I could also list several examples where folded shifts, predication, and post-adjust addressing have made a big difference in inner loops of my ARM code. I won't bother making an argument about these things because it's truly anecdotal.
I'm looking for more current references for x86 instruction category histograms. Here's one that's fairly recent one:
http://www.strchr.com/x86_machine_code_statistics
This lists absolute addressing is only being 1% of operand types. Mind you, this is by program scan and not by dynamic usage, but I would hardly expect that they'd be more frequently appearing in execution since you generally want to keep global variables out of your tight inner loops.
I could find some other histograms but they're really old, and let's face it - code style has changed. Old 16-bit statistics are especially too far removed. If you have others please feel free to provide them.
What I CAN provide is some ARM statistics. I have a Nintendo DS emulator capable of profiling ARM instruction frequency (dynamically, by execution). I've played four games for a few minutes (I only looked at ARM code, not Thumb: most games are more ARM heavy). Two were 2D and two were 3D. This is what I found:
- 45.3% of instructions are ALU operations
- 51% of ALU instructions are reg/reg, 49% are reg/imm
- 40.66% of reg/reg ALU instructions have an inlined shift
So about 9.3% of instructions use inlined shifts. This number is however inflated since it counts mov as an ALU instruction, so all shifts are included. But in my experience shifts can be folded more often than not.
- 24.2% of instructions are memory operations
- 70.8% of memory operations are loads, 29.2% are stores
- 6.7% of memory operations are post-adjust (but this value varies a lot more than the others)
So about 1.6% of instructions use post-adjust addressing.
And about 3.72% of instructions are block memory (no details on average register count, sorry). And no, ARM's block memory instructions and x86's rep movs and friends are NOT the same. For copying large blocks of memory they're close to the same, but ARM's block memory instructions are more commonly used for saving and restoring registers. Which happens all the time at function prologues and epilogues, and before function calls for that matter. x86 has pusha but that does all registers (but sp) and implementations have greatly neglected it to the point where it's never used anymore. ARM block memory instructions, on the other hand, have been maintained as decent performers.
I'd like to give data on predication but my profiling isn't very good with that: it tracks predication type, taken branches, and untaken instructions in general, but it doesn't distinguish types for untakens. I'll need to adjust it to get more information.