By: Patrick Chase (patrickjchase.delete@this.gmail.com), February 3, 2013 4:29 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on February 2, 2013 11:10 am wrote:
> Patrick Chase (patrickjchase.delete@this.gmail.com) on February 1, 2013 10:11 pm wrote:
> > David suggested posting this to the forum. I think he has a few remarks of his own to add on this topic...
> >
> > I think that the statement that x86 takes 5-15% more area than RISC is a bit simplistic,
> > because the penalty is highly variable depending on what performance level you're
> > targeting and what sort of microarchitecture you have to use to get there.
>
> x86 also has a steeper learning curve as one needs to learn the tricks to handle various odds
> and ends. Intel and AMD already have institutional knowledge about implementation (including
> validation tools), but a third party is less likely to find implementing a variant or an original
> design worthwhile (even if Intel provided the appropriate licensing). It has also been argued
> that a "necessity is the mother of invention" factor drove x86 implementers to innovate.
The P6 and its successors are not terribly innovative. P6 itself is basically a straightforward Tomasulo machine, typical of its generation. The x86-ness was mostly contained in the decoders and in "frontend tricks" like eliding FXCHs via RAT/ROB machinations (arguably a predecessor of uop fusion). If you look at a modern x86 like Haswell it's also a fairly typical OoO microprocessor in the PRF style (schedulers instead of reservation stations, dataless ROB). The principal factors that distinguish it from non-x86 designs are the decoders (of course), the high load/store bandwidth to L1 (doubtless to deal with all of the stack accesses), and frontend tricks like micro- and macro-op fusion. You could argue that the L0 uop cache is also an x86 optimization in that it mitigates the x86 decode penalty, but OoO RISCs have also used predecoded I-caches (for example PA-8000).
What I think really happened is not so much that Intel and AMD got more innovative as that the field of play (specifically transistor/power budgets and performance requirements) moved to a space where they could be competitive *using the same architectural techniques as everybody else*. In other words, if everybody is doing OoO machines then the penalty for x86 is inherently much lower than if everybody [else] is doing classic 5-stage RISC pipelines. If true, then that has huge ramifications for the present discussion, given the direction the ARM camp are headed right now in their effort to steal server share from Intel.
> A clean RISC like Alpha (or--from what I have read--AArch64) would be much more friendly to fast bring-up
> of a decent microarchitecture. (Classic ARM seems to be somewhere in the middle--not as complex as x86 but
> not as simple as Alpha--, but even with Thumb2+classic ARM it might be closer to Alpha than to x86.)
AArch64 is indeed a nice, classic RISC architecture. In particular they fixed the biggest single limitation of classic ARM (spending instruction encoding bits on condition-based predication at the expense of GPRs). I'd put it somewhere between MIPS and Alpha on the "architectural purity scale", and that's a pretty good place to be.
That said, we've seen several nice, classic RISC architectures go up against x86 in an application space that required wide OoO designs, and they ended up not having much of an advantage when all was said and done. That was really the whole point of my post: You can't look at the x86 penalty versus the "classic 5-stage RISC pipe" or even in the "static superscalar" regime, and assume that will hold when everybody is forced by performance requirements to do 3+-wide OoO microarchitectures.
> [snip]
> > My own take is that for ARM-based microservers to survive they need to stay down in the "many weak cores"
> > regime and focus on massively parallel workloads that can tolerate the latency penalty. If they try to
> > move up into higher performance brackets then they'll be playing directly into Intel's hand.
>
> I agree that trying to compete with Intel x86 at the high performance end will be excessively difficult,
> but I think the ARM brigade may have a flexibility advantage. Even though Intel has been demonstrating some
> willingness to try new things and develop concurrent multiple microarchitectures, Intel seems to be too conservative
> to try radical designs. It is not clear that ARM will take advantage of its greater tolerance of diversity
> (while learning to provide a coherent interface to software) to introduce some weird and wonderful architectural
> features. ARM has been very quiet about transactional memory and multithreading; features along the lines
> of Intel's TSX and MIPS' MT-ASE could be significant in the server market.
That is indeed ARM's greatest historical strength, and IMO the reason why they succeeded where some RISC players with significantly cleaner ISAs didn't. As you may have gathered I'm not a fan of the pre-AArch64 ARM ISA, and yet I have a LOT of experience developing for it. The reason why is because the overall ARM solution (including ecosystem, IP cost, and ease of integration) was compelling for the applications in question, despite any architectural foibles.
I think that business model will be under increasing pressure going forward, though. Fixed development costs (design time, masks, bringup and FW) for SoCs are increasing by a factor of about sqrt(2) per generation, and that's gradually squeezing smaller players out and forcing them to buy standardized parts, whether from TI, NVidia, Qualcomm, or Intel. At that point the fact that your *supplier* had a lot of flexibility and an easy time integrating the core is of rather secondary importance. From the end developer's perspective ARM still benefits from a strong ecosystem, but so does x86.
> Even if ARM does not innovate much architecturally, I think the implementers may feel
> much more free to try different accelerators and microarchitectural tweaks. With
> an Architecture license, non-ARM implementers could even add new instructions.
People with ARM architecture licenses have added custom instructions to ARM, generally for application-specific DSP-like functionality. I don't know how much of what I know is public, so I'll leave it at that. I would observe that ISA changes at the "OS/compiler visible" level (beyond adding an intrinsic here or there) are less common, probably because those would tend to erode the ecosystem benefits. I would also observe that as microarchitectures become more complex people tend to get priced out of that game as well. I expect it's vastly easier to create and verify, say, an ARM9 or R4 derivative than to a customized A15.
Please don't get me wrong: I'm not a proponent of x86 as an ISA. Looking at the assembly from even the best compilers (icc) is a discouraging experience. With that said, I personally don't buy the hype around ARM microservers. We've seen this play out before, and I personally think this time won't be different.
The point you made in a separate post about big.LITTLE is probably the most compelling argument I've heard to date, but I think the evidence is very thin. In particular, a lot will depend on whether "LITTLE" ends up being small enough to put x86 into the noncompetitive regime. We shall see :-).
> Patrick Chase (patrickjchase.delete@this.gmail.com) on February 1, 2013 10:11 pm wrote:
> > David suggested posting this to the forum. I think he has a few remarks of his own to add on this topic...
> >
> > I think that the statement that x86 takes 5-15% more area than RISC is a bit simplistic,
> > because the penalty is highly variable depending on what performance level you're
> > targeting and what sort of microarchitecture you have to use to get there.
>
> x86 also has a steeper learning curve as one needs to learn the tricks to handle various odds
> and ends. Intel and AMD already have institutional knowledge about implementation (including
> validation tools), but a third party is less likely to find implementing a variant or an original
> design worthwhile (even if Intel provided the appropriate licensing). It has also been argued
> that a "necessity is the mother of invention" factor drove x86 implementers to innovate.
The P6 and its successors are not terribly innovative. P6 itself is basically a straightforward Tomasulo machine, typical of its generation. The x86-ness was mostly contained in the decoders and in "frontend tricks" like eliding FXCHs via RAT/ROB machinations (arguably a predecessor of uop fusion). If you look at a modern x86 like Haswell it's also a fairly typical OoO microprocessor in the PRF style (schedulers instead of reservation stations, dataless ROB). The principal factors that distinguish it from non-x86 designs are the decoders (of course), the high load/store bandwidth to L1 (doubtless to deal with all of the stack accesses), and frontend tricks like micro- and macro-op fusion. You could argue that the L0 uop cache is also an x86 optimization in that it mitigates the x86 decode penalty, but OoO RISCs have also used predecoded I-caches (for example PA-8000).
What I think really happened is not so much that Intel and AMD got more innovative as that the field of play (specifically transistor/power budgets and performance requirements) moved to a space where they could be competitive *using the same architectural techniques as everybody else*. In other words, if everybody is doing OoO machines then the penalty for x86 is inherently much lower than if everybody [else] is doing classic 5-stage RISC pipelines. If true, then that has huge ramifications for the present discussion, given the direction the ARM camp are headed right now in their effort to steal server share from Intel.
> A clean RISC like Alpha (or--from what I have read--AArch64) would be much more friendly to fast bring-up
> of a decent microarchitecture. (Classic ARM seems to be somewhere in the middle--not as complex as x86 but
> not as simple as Alpha--, but even with Thumb2+classic ARM it might be closer to Alpha than to x86.)
AArch64 is indeed a nice, classic RISC architecture. In particular they fixed the biggest single limitation of classic ARM (spending instruction encoding bits on condition-based predication at the expense of GPRs). I'd put it somewhere between MIPS and Alpha on the "architectural purity scale", and that's a pretty good place to be.
That said, we've seen several nice, classic RISC architectures go up against x86 in an application space that required wide OoO designs, and they ended up not having much of an advantage when all was said and done. That was really the whole point of my post: You can't look at the x86 penalty versus the "classic 5-stage RISC pipe" or even in the "static superscalar" regime, and assume that will hold when everybody is forced by performance requirements to do 3+-wide OoO microarchitectures.
> [snip]
> > My own take is that for ARM-based microservers to survive they need to stay down in the "many weak cores"
> > regime and focus on massively parallel workloads that can tolerate the latency penalty. If they try to
> > move up into higher performance brackets then they'll be playing directly into Intel's hand.
>
> I agree that trying to compete with Intel x86 at the high performance end will be excessively difficult,
> but I think the ARM brigade may have a flexibility advantage. Even though Intel has been demonstrating some
> willingness to try new things and develop concurrent multiple microarchitectures, Intel seems to be too conservative
> to try radical designs. It is not clear that ARM will take advantage of its greater tolerance of diversity
> (while learning to provide a coherent interface to software) to introduce some weird and wonderful architectural
> features. ARM has been very quiet about transactional memory and multithreading; features along the lines
> of Intel's TSX and MIPS' MT-ASE could be significant in the server market.
That is indeed ARM's greatest historical strength, and IMO the reason why they succeeded where some RISC players with significantly cleaner ISAs didn't. As you may have gathered I'm not a fan of the pre-AArch64 ARM ISA, and yet I have a LOT of experience developing for it. The reason why is because the overall ARM solution (including ecosystem, IP cost, and ease of integration) was compelling for the applications in question, despite any architectural foibles.
I think that business model will be under increasing pressure going forward, though. Fixed development costs (design time, masks, bringup and FW) for SoCs are increasing by a factor of about sqrt(2) per generation, and that's gradually squeezing smaller players out and forcing them to buy standardized parts, whether from TI, NVidia, Qualcomm, or Intel. At that point the fact that your *supplier* had a lot of flexibility and an easy time integrating the core is of rather secondary importance. From the end developer's perspective ARM still benefits from a strong ecosystem, but so does x86.
> Even if ARM does not innovate much architecturally, I think the implementers may feel
> much more free to try different accelerators and microarchitectural tweaks. With
> an Architecture license, non-ARM implementers could even add new instructions.
People with ARM architecture licenses have added custom instructions to ARM, generally for application-specific DSP-like functionality. I don't know how much of what I know is public, so I'll leave it at that. I would observe that ISA changes at the "OS/compiler visible" level (beyond adding an intrinsic here or there) are less common, probably because those would tend to erode the ecosystem benefits. I would also observe that as microarchitectures become more complex people tend to get priced out of that game as well. I expect it's vastly easier to create and verify, say, an ARM9 or R4 derivative than to a customized A15.
Please don't get me wrong: I'm not a proponent of x86 as an ISA. Looking at the assembly from even the best compilers (icc) is a discouraging experience. With that said, I personally don't buy the hype around ARM microservers. We've seen this play out before, and I personally think this time won't be different.
The point you made in a separate post about big.LITTLE is probably the most compelling argument I've heard to date, but I think the evidence is very thin. In particular, a lot will depend on whether "LITTLE" ends up being small enough to put x86 into the noncompetitive regime. We shall see :-).