Article: AMD's Mobile Strategy
By: anon (anon.delete@this.anon.com), December 15, 2011 8:58 pm
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 12/15/11 wrote:
---------------------------
>anon (anon@anon.com) on 12/15/11 wrote:
>---------------------------
>>Linus Torvalds (torvalds@linux-foundation.org) on 12/15/11 wrote:
>>---------------------------
>>>Exophase (exophase@gmail.com) on 12/15/11 wrote:
>>>>Your response makes it sound as if we know nothing
>>>>about ARM64 when we know pretty much everything about it.
>>>
>>>Umm. Are you reading the same thread I am? Are you able to
>>>read at all?
>>>
>>>The whole (and only) point of the thread is power
>>>efficiency. Read the damn subject line again.
>>>
>>>We know absolutely nothing about what the upcoming ARM64
>>>systems will be like from a power-efficiency standpoint.
>>>Especially not if they are four-way decode and aim to be
>>>performance-competitive with x86.
>>>
>>>And no, I do not believe that 4-way ARM64 is going to be
>>>equivalent to 4-way x86. That said, I also don't think that
>>>the decoder is really the dominant factor in that kind of
>>>environment anyway, so I think it's pretty much all
>>>theoretical.
>>
>>The decoder seems to figure quite prominently even in high >power applications.
>>Intel has been trying to remove decoder from critical path >since P4, and it seems
>>like in Haswell there will be another push towards this.
>
>Both x86 and ARM are variable length. ARM's advantage for decoding is that it is
>really only 2B or 4B instructions, which makes life a hell of a lot easier. x86
>has length-changing prefixes, byte instructions, etc.
>
>But ARMv8 has to support Thumb and other parts of the ISA, so while the decode
>is simpler...it's not *simple* by any means.
I would just like to see any papers or more formal analysis, that's all. We can handwave all day about it :)
>
>It's also worth pointing out that x86 decode is only a real pain when you talk
>about wide decode. A scalar x86 isn't all that hard, since you always know where to look for your instruction : )
Yes, I'm sure that is a huge complicating factor.
>>I hear lots of claims one way or the other about whether >x86 decoders cost a lot.
>>Is there any proof? Because on the evidence of what Intel >is doing, they seem to be costly.
>
>I think it's pretty fair to say that 4-wide x86 decode is likely to be expensive.
>I really have no idea how much power goes to decoding in a modern design, especially
>with the uop cache. But I suspect most of the complexity comes from a few nasty
>parts of x86. Doing load+op probably isn't a big deal if your ISA is fairly regular (e.g. zArch).
Well we really want to talk about power and also performance, because we're ultimately interested in perf/watt, of course. So if a low power decoder can be done with an extra cycle, the cost is simply moved from power to performance part of the equation.
uop cache of course is evidence that decoders are still a big issue for Intel, although they are alone not evidence that x86 decoder being worse than another ISA. Although I suspect at 4-wide, it would be far more difficult than a fixed length decoder.
Now if Haswell does something radical to move the decoders completely out of the critical path (eg. a big L2 uop cache, and shared decoders that only needs to match L3 refill bandwidth at a relaxed latency), then the decoder itself is no longer a problem, but the "economic" cost of decoding x86 instructions is still high, if it has necessitated such a design that other ISAs can do without.
So the uop cache reduces the cost of the actual decoder unit, but the entire cost of decoding x86 instructions is still high, even on high power uarch, despite all the assertions to the contrary.
Seeing that "conventional wisdom" is incorrect on this matter, I'm just interested in more evidence. I can not contribute anything more than handwaving myself.
>
>More to the point, Linus is totally right that you cannot compare 4-wide x86 decode
>to a 4-wide ARM decode. The semantic content of the instructions is totally different.
Yes, but I think that excursion derailed the main point of the discussion.
x86 versus ARM / ARM64 instruction expressiveness would be another interesting topic. I'm sure there are studies around, I have not seen any recent ones though.
>
>There's no way to know how the two really compare without measuring dynamic instruction
>count on relevant workloads, but here are some thinking points:
>
>1. x86 is load+op, which is the equivalent of 2 ARM instructions. So the Nehalem
>decoder can output the equivalent of 8 ARMv8-A instructions.
>
>2. ARMv8-A has twice as many registers, hence fewer register spills.
>
>3. x86 has more powerful addressing, reducing ALU ops.
>
>4. x86 has microcoded instructions like REP MOV, which can translate into many
>ARMv8-A instructions. OTOH, this is a serious pain in the ass for decode, and for handling exceptions, atomics, etc.
>
>5. ARMv8-A has a weaker ordering model, meaning that extra instructions are necessary.
>Often, x86 synchronization ends up as a NOP. Of course, this forces x86 to use large
>ordering buffers, whereas ARM does not.
>
>6. ARMv8-A has more robust conditionals.
>
>7. ARMv8-A does not have the blight known as x87.
>
>8. I can't recall how much of x86 and ARMv8-A are 2 vs. 3 operand and how much
>are destructive versus non-destructive, but this matters too.
>
>I'd also like to emphasize that many of the things which make x86 instructions
>more compact make decoding a real pain or have other implications on the pipeline.
>
>On the whole, I strongly suspect x86 has a lower dynamic instruction count...but
>I'm not sure by how much. Is it 10%, 20%, 30%? I have no basis for knowing (especially
>since there aren't many ARMv8-A workloads to run).
>
>But that still is only part of the picture. While a 4-wide x86 decoder might be
>equivalent to 5-wide ARMv8-A decode, that still doesn't tell you how much harder
>x86 decode is...and what other costs it imposes throughout the pipeline.
No. A 4-wide decode of a "nice" ISA is going to be far easier than a 4-wide decode of x86, regardless of the exact expressiveness of the instructions being decoded. I think that was the point of Wilco's remark about decoding, rather than somehow being a statement that x86 instruction is exactly identical to ARM instruction that people seemed to have taken it as.
(He did go on to talk about efficiency of ARM etc, but that did not seem to be a conclusion he attempted to draw from exactly the fact that 4 wide decode is equivalent, just that x86 has high decode overhead, which it does).
---------------------------
>anon (anon@anon.com) on 12/15/11 wrote:
>---------------------------
>>Linus Torvalds (torvalds@linux-foundation.org) on 12/15/11 wrote:
>>---------------------------
>>>Exophase (exophase@gmail.com) on 12/15/11 wrote:
>>>>Your response makes it sound as if we know nothing
>>>>about ARM64 when we know pretty much everything about it.
>>>
>>>Umm. Are you reading the same thread I am? Are you able to
>>>read at all?
>>>
>>>The whole (and only) point of the thread is power
>>>efficiency. Read the damn subject line again.
>>>
>>>We know absolutely nothing about what the upcoming ARM64
>>>systems will be like from a power-efficiency standpoint.
>>>Especially not if they are four-way decode and aim to be
>>>performance-competitive with x86.
>>>
>>>And no, I do not believe that 4-way ARM64 is going to be
>>>equivalent to 4-way x86. That said, I also don't think that
>>>the decoder is really the dominant factor in that kind of
>>>environment anyway, so I think it's pretty much all
>>>theoretical.
>>
>>The decoder seems to figure quite prominently even in high >power applications.
>>Intel has been trying to remove decoder from critical path >since P4, and it seems
>>like in Haswell there will be another push towards this.
>
>Both x86 and ARM are variable length. ARM's advantage for decoding is that it is
>really only 2B or 4B instructions, which makes life a hell of a lot easier. x86
>has length-changing prefixes, byte instructions, etc.
>
>But ARMv8 has to support Thumb and other parts of the ISA, so while the decode
>is simpler...it's not *simple* by any means.
I would just like to see any papers or more formal analysis, that's all. We can handwave all day about it :)
>
>It's also worth pointing out that x86 decode is only a real pain when you talk
>about wide decode. A scalar x86 isn't all that hard, since you always know where to look for your instruction : )
Yes, I'm sure that is a huge complicating factor.
>>I hear lots of claims one way or the other about whether >x86 decoders cost a lot.
>>Is there any proof? Because on the evidence of what Intel >is doing, they seem to be costly.
>
>I think it's pretty fair to say that 4-wide x86 decode is likely to be expensive.
>I really have no idea how much power goes to decoding in a modern design, especially
>with the uop cache. But I suspect most of the complexity comes from a few nasty
>parts of x86. Doing load+op probably isn't a big deal if your ISA is fairly regular (e.g. zArch).
Well we really want to talk about power and also performance, because we're ultimately interested in perf/watt, of course. So if a low power decoder can be done with an extra cycle, the cost is simply moved from power to performance part of the equation.
uop cache of course is evidence that decoders are still a big issue for Intel, although they are alone not evidence that x86 decoder being worse than another ISA. Although I suspect at 4-wide, it would be far more difficult than a fixed length decoder.
Now if Haswell does something radical to move the decoders completely out of the critical path (eg. a big L2 uop cache, and shared decoders that only needs to match L3 refill bandwidth at a relaxed latency), then the decoder itself is no longer a problem, but the "economic" cost of decoding x86 instructions is still high, if it has necessitated such a design that other ISAs can do without.
So the uop cache reduces the cost of the actual decoder unit, but the entire cost of decoding x86 instructions is still high, even on high power uarch, despite all the assertions to the contrary.
Seeing that "conventional wisdom" is incorrect on this matter, I'm just interested in more evidence. I can not contribute anything more than handwaving myself.
>
>More to the point, Linus is totally right that you cannot compare 4-wide x86 decode
>to a 4-wide ARM decode. The semantic content of the instructions is totally different.
Yes, but I think that excursion derailed the main point of the discussion.
x86 versus ARM / ARM64 instruction expressiveness would be another interesting topic. I'm sure there are studies around, I have not seen any recent ones though.
>
>There's no way to know how the two really compare without measuring dynamic instruction
>count on relevant workloads, but here are some thinking points:
>
>1. x86 is load+op, which is the equivalent of 2 ARM instructions. So the Nehalem
>decoder can output the equivalent of 8 ARMv8-A instructions.
>
>2. ARMv8-A has twice as many registers, hence fewer register spills.
>
>3. x86 has more powerful addressing, reducing ALU ops.
>
>4. x86 has microcoded instructions like REP MOV, which can translate into many
>ARMv8-A instructions. OTOH, this is a serious pain in the ass for decode, and for handling exceptions, atomics, etc.
>
>5. ARMv8-A has a weaker ordering model, meaning that extra instructions are necessary.
>Often, x86 synchronization ends up as a NOP. Of course, this forces x86 to use large
>ordering buffers, whereas ARM does not.
>
>6. ARMv8-A has more robust conditionals.
>
>7. ARMv8-A does not have the blight known as x87.
>
>8. I can't recall how much of x86 and ARMv8-A are 2 vs. 3 operand and how much
>are destructive versus non-destructive, but this matters too.
>
>I'd also like to emphasize that many of the things which make x86 instructions
>more compact make decoding a real pain or have other implications on the pipeline.
>
>On the whole, I strongly suspect x86 has a lower dynamic instruction count...but
>I'm not sure by how much. Is it 10%, 20%, 30%? I have no basis for knowing (especially
>since there aren't many ARMv8-A workloads to run).
>
>But that still is only part of the picture. While a 4-wide x86 decoder might be
>equivalent to 5-wide ARMv8-A decode, that still doesn't tell you how much harder
>x86 decode is...and what other costs it imposes throughout the pipeline.
No. A 4-wide decode of a "nice" ISA is going to be far easier than a 4-wide decode of x86, regardless of the exact expressiveness of the instructions being decoded. I think that was the point of Wilco's remark about decoding, rather than somehow being a statement that x86 instruction is exactly identical to ARM instruction that people seemed to have taken it as.
(He did go on to talk about efficiency of ARM etc, but that did not seem to be a conclusion he attempted to draw from exactly the fact that 4 wide decode is equivalent, just that x86 has high decode overhead, which it does).