Article: AMD's Mobile Strategy
By: anon (anon.delete@this.anon.com), December 17, 2011 11:13 am
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 12/15/11 wrote:
---------------------------
>anon (anon@anon.com) on 12/15/11 wrote:
>---------------------------
>>Linus Torvalds (torvalds@linux-foundation.org) on 12/15/11 wrote:
>>---------------------------
>>>Exophase (exophase@gmail.com) on 12/15/11 wrote:
>>>>Your response makes it sound as if we know nothing
>>>>about ARM64 when we know pretty much everything about it.
>>>
>>>Umm. Are you reading the same thread I am? Are you able to
>>>read at all?
>>>
>>>The whole (and only) point of the thread is power
>>>efficiency. Read the damn subject line again.
>>>
>>>We know absolutely nothing about what the upcoming ARM64
>>>systems will be like from a power-efficiency standpoint.
>>>Especially not if they are four-way decode and aim to be
>>>performance-competitive with x86.
>>>
>>>And no, I do not believe that 4-way ARM64 is going to be
>>>equivalent to 4-way x86. That said, I also don't think that
>>>the decoder is really the dominant factor in that kind of
>>>environment anyway, so I think it's pretty much all
>>>theoretical.
>>
>>The decoder seems to figure quite prominently even in high >power applications.
>>Intel has been trying to remove decoder from critical path >since P4, and it seems
>>like in Haswell there will be another push towards this.
>
>Both x86 and ARM are variable length. ARM's advantage for decoding is that it is
>really only 2B or 4B instructions, which makes life a hell of a lot easier. x86
>has length-changing prefixes, byte instructions, etc.
>
>But ARMv8 has to support Thumb and other parts of the ISA, so while the decode
>is simpler...it's not *simple* by any means.
>
>It's also worth pointing out that x86 decode is only a real pain when you talk
>about wide decode. A scalar x86 isn't all that hard, since you always know where to look for your instruction : )
>
>>I hear lots of claims one way or the other about whether >x86 decoders cost a lot.
>>Is there any proof? Because on the evidence of what Intel >is doing, they seem to be costly.
>
>I think it's pretty fair to say that 4-wide x86 decode is likely to be expensive.
>I really have no idea how much power goes to decoding in a modern design, especially
>with the uop cache. But I suspect most of the complexity comes from a few nasty
>parts of x86. Doing load+op probably isn't a big deal if your ISA is fairly regular (e.g. zArch).
>
>More to the point, Linus is totally right that you cannot compare 4-wide x86 decode
>to a 4-wide ARM decode. The semantic content of the instructions is totally different.
>
>There's no way to know how the two really compare without measuring dynamic instruction
>count on relevant workloads, but here are some thinking points:
>
>1. x86 is load+op, which is the equivalent of 2 ARM instructions. So the Nehalem
>decoder can output the equivalent of 8 ARMv8-A instructions.
>
>2. ARMv8-A has twice as many registers, hence fewer register spills.
>
>3. x86 has more powerful addressing, reducing ALU ops.
>
>4. x86 has microcoded instructions like REP MOV, which can translate into many
>ARMv8-A instructions. OTOH, this is a serious pain in the ass for decode, and for handling exceptions, atomics, etc.
>
>5. ARMv8-A has a weaker ordering model, meaning that extra instructions are necessary.
>Often, x86 synchronization ends up as a NOP. Of course, this forces x86 to use large
>ordering buffers, whereas ARM does not.
>
>6. ARMv8-A has more robust conditionals.
How do you figure this so? What do you mean by robust conditionals?
I don't think there is evidence for this point.
>7. ARMv8-A does not have the blight known as x87.
>
>8. I can't recall how much of x86 and ARMv8-A are 2 vs. 3 operand and how much
>are destructive versus non-destructive, but this matters too.
>
>I'd also like to emphasize that many of the things which make x86 instructions
>more compact make decoding a real pain or have other implications on the pipeline.
>
>On the whole, I strongly suspect x86 has a lower dynamic instruction count...but
>I'm not sure by how much. Is it 10%, 20%, 30%? I have no basis for knowing (especially
>since there aren't many ARMv8-A workloads to run).
>
>But that still is only part of the picture. While a 4-wide x86 decoder might be
>equivalent to 5-wide ARMv8-A decode, that still doesn't tell you how much harder
>x86 decode is...and what other costs it imposes throughout the pipeline.
>
>David
---------------------------
>anon (anon@anon.com) on 12/15/11 wrote:
>---------------------------
>>Linus Torvalds (torvalds@linux-foundation.org) on 12/15/11 wrote:
>>---------------------------
>>>Exophase (exophase@gmail.com) on 12/15/11 wrote:
>>>>Your response makes it sound as if we know nothing
>>>>about ARM64 when we know pretty much everything about it.
>>>
>>>Umm. Are you reading the same thread I am? Are you able to
>>>read at all?
>>>
>>>The whole (and only) point of the thread is power
>>>efficiency. Read the damn subject line again.
>>>
>>>We know absolutely nothing about what the upcoming ARM64
>>>systems will be like from a power-efficiency standpoint.
>>>Especially not if they are four-way decode and aim to be
>>>performance-competitive with x86.
>>>
>>>And no, I do not believe that 4-way ARM64 is going to be
>>>equivalent to 4-way x86. That said, I also don't think that
>>>the decoder is really the dominant factor in that kind of
>>>environment anyway, so I think it's pretty much all
>>>theoretical.
>>
>>The decoder seems to figure quite prominently even in high >power applications.
>>Intel has been trying to remove decoder from critical path >since P4, and it seems
>>like in Haswell there will be another push towards this.
>
>Both x86 and ARM are variable length. ARM's advantage for decoding is that it is
>really only 2B or 4B instructions, which makes life a hell of a lot easier. x86
>has length-changing prefixes, byte instructions, etc.
>
>But ARMv8 has to support Thumb and other parts of the ISA, so while the decode
>is simpler...it's not *simple* by any means.
>
>It's also worth pointing out that x86 decode is only a real pain when you talk
>about wide decode. A scalar x86 isn't all that hard, since you always know where to look for your instruction : )
>
>>I hear lots of claims one way or the other about whether >x86 decoders cost a lot.
>>Is there any proof? Because on the evidence of what Intel >is doing, they seem to be costly.
>
>I think it's pretty fair to say that 4-wide x86 decode is likely to be expensive.
>I really have no idea how much power goes to decoding in a modern design, especially
>with the uop cache. But I suspect most of the complexity comes from a few nasty
>parts of x86. Doing load+op probably isn't a big deal if your ISA is fairly regular (e.g. zArch).
>
>More to the point, Linus is totally right that you cannot compare 4-wide x86 decode
>to a 4-wide ARM decode. The semantic content of the instructions is totally different.
>
>There's no way to know how the two really compare without measuring dynamic instruction
>count on relevant workloads, but here are some thinking points:
>
>1. x86 is load+op, which is the equivalent of 2 ARM instructions. So the Nehalem
>decoder can output the equivalent of 8 ARMv8-A instructions.
>
>2. ARMv8-A has twice as many registers, hence fewer register spills.
>
>3. x86 has more powerful addressing, reducing ALU ops.
>
>4. x86 has microcoded instructions like REP MOV, which can translate into many
>ARMv8-A instructions. OTOH, this is a serious pain in the ass for decode, and for handling exceptions, atomics, etc.
>
>5. ARMv8-A has a weaker ordering model, meaning that extra instructions are necessary.
>Often, x86 synchronization ends up as a NOP. Of course, this forces x86 to use large
>ordering buffers, whereas ARM does not.
>
>6. ARMv8-A has more robust conditionals.
How do you figure this so? What do you mean by robust conditionals?
I don't think there is evidence for this point.
>7. ARMv8-A does not have the blight known as x87.
>
>8. I can't recall how much of x86 and ARMv8-A are 2 vs. 3 operand and how much
>are destructive versus non-destructive, but this matters too.
>
>I'd also like to emphasize that many of the things which make x86 instructions
>more compact make decoding a real pain or have other implications on the pipeline.
>
>On the whole, I strongly suspect x86 has a lower dynamic instruction count...but
>I'm not sure by how much. Is it 10%, 20%, 30%? I have no basis for knowing (especially
>since there aren't many ARMv8-A workloads to run).
>
>But that still is only part of the picture. While a 4-wide x86 decoder might be
>equivalent to 5-wide ARMv8-A decode, that still doesn't tell you how much harder
>x86 decode is...and what other costs it imposes throughout the pipeline.
>
>David