By: juanrga (nospam.delete@this.juanrga.com), August 16, 2014 11:35 am
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on August 16, 2014 9:35 am wrote:
> juanrga (nospam.delete@this.juanrga.com) on August 16, 2014 4:29 am wrote:
> > Mark Roulo (nothanks.delete@this.xxx.com) on August 14, 2014 8:30 am wrote:
> > > David Kanter (dkanter.delete@this.realworldtech.com) on August 13, 2014 11:47 am wrote:
> > > > > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > > > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > > > > development costs required to build an x86-based server chip based on a new micro-architecture.
> > > > >
> > > > > An interesting video confirms what you are saying. Search for: Jim
> > > > > Keller On AMD's Next-Gen High Performance x86 & K12 ARM Cores.
> > > > >
> > > > > Saw this couple of months, but if my memory serves me correctly, he said that with the
> > > > > same transistor budget he is able to build a faster core with Aarch64 than with x86_64.
> > > > > Time will tell if that is true, but Jim seems to know what he is talking about.
> > > >
> > > > Jim is 100% right - it is a bit easier to design an ARMv8 core than
> > > > x86, all things being equal. How much is the difference though?
> > > >
> > > > I wrote about this extensively before:
> > > >
> > > > http://www.realworldtech.com/microservers/4/
> > > >
> > > > My analysis is as follows: assume a 15% gain for an ARM core vs. x86 (I think 5-10% is more realistic, but
> > > > let's be generous), that is only a 5% gain at the chip level. 5% just isn't a significant advantage.
> > >
> > > It won't matter for the folks that think that the ARM ISA provide a huge advantage when designing 10+ Watt
> > > chips, but ... about 10-ish years ago Microprocessor Report had an article that included a discussion with
> > > one of the POWER architects. He mentioned that for the
> > > space that POWER was competing, the ISA didn't matter
> > > enough(*) to be worth getting worked up over. It would be interesting to see if there was any elaboration
> > > (because I'm going off 10+ year old memory here), but that would require access to the article. Googling
> > > hasn't turned up anything (which isn't much of a surprise). This *IS* an argument from authority, but in
> > > this case armchair analysis by folks who have never built a high performance CPU is pretty suspect :-)
> > >
> > >
> > >
> > >
> > > (*) The implication was that the ISA wasn't a complete performance killer. A register-to-register
> > > architecture with only two registers would obviously be a non-trivial disadvantage. So might an ISA
> > > that *required* sequential instruction decoding (the 68K family was supposed to have this problem).
> >
> >
> > I am unsure if Jim Keller words are being considered here
> > an "armchair analysis", thus I will refer to hard data.
> >
> > In the first place it is worth mentioning that POWER guys have ignored the "ISA didn't
> > matter enough(*) to be worth getting worked up over" and have been developing the ISA
> > during last ten years. The current version of POWER ISA is the 2.06 (revision B).
> >
> > It is also evident that Intel has gained most of its performance from new ISAs. Under
> > x86 ISA the gains are 5% per year or so. Using new extensions such as TSX, AVX2... you
> > can see 2x performance gains in a single generation change (e.g. IVB --> HW).
> >
> > Intel knows that ISAs matter for performance and has given a recent talk about all
> > the new ISAs that is developing: AVX512F, AVX512{VL, DQ, BW}, CDI, ERI, PFI,...
> >
> > You mention Microprocessor Report. Precisely the guys at Microprocessor Report estimated the
> > Cortex A57 would gain 10% performance when running in ARM AArch64 rather than AArch32 mode
> >
> > Does ISA Matter for Performance?
> >
> > I.e. the same processor core, the same program, just compiled using a different set of
> > instructions. Code rewritten and optimized for the new ARM ISA produces bigger gains.
>
> As a technical matter, the changes to POWER ISA have all, as far as I know, been on
> the OS side; mostly to give the hypervisor better control over paging. I think there's
> been one change to the collection of user-level sync instructions. I agree with your
> stance that ISA matters, but I don't think this aspect of POWER proves the point.
The "Does ISA Matter for Performance?" article linked above mentions one of the changes made to the POWER ISA for improving performance. This is the relevant fragment:
> Also, changing vector length isn't especially interesting in this context. No-one denies that
> just doubling the length of NEON vectors would improve a certain collection of code, and quadrupling
> them would give further improvement. But that's not especially interesting; in the same way
> that doubling the number of CPUs on your chip is not especially interesting.
> More interesting would be some way to quantify the improvement gained by additional "different" types of instructions
> that Intel has added. AVX is such a monster I can't tell what's what there and what was added when; but as
> far as I can tell until AVX-512 they didn't have a decent generic Permute, only a collection of various specialized
> shuffles. One could try, for example, to quantify the value of generic permute as opposed to limiting the compiler
> to, say, the collection of SSE4.1 shuffles and their 256/512 bit counterparts.
> One could do the same sort of reckoning for scatter and gather, again genuinely
> new types of instructions, or the new K-register masked instructions.
The new ISAs developed by Intel CDI, ERI, PFI are not about doubling the size of vectors, but look for increasing performance in concrete cases. The original AVX512F doubles the size to 512bits, but the new AVX512{VL, DQ, BW} maintain the same 512bits and introduce other features to improve performance in specific cases where AVX512F is not enough.
> On the POWER side, something closer to what I think we all have in mind is the mini-graph
> processing that was added to POWER8 to provide macro-instruction fusion and thereby provide
> better support for wide immediates (especially immediates used to load from the GOT).
> I mention this because it shows, in a sense, the flip side to the x86 experience. Where x86 can (and
> does) more or less easily add new capabilities through new instructions, RISC CPUs can add (in a backward
> compatible fashion) SOME additional capabilities through macro-instruction fusion/mini-graph processing.
> POWER8 has done this in a very limited and specialized way, but like anything, the first step is the
> hardest. I could imagine them combing through instruction traces looking for common instruction pairs
> and if there are others that are hot enough, giving those also the fusion treatment.
> juanrga (nospam.delete@this.juanrga.com) on August 16, 2014 4:29 am wrote:
> > Mark Roulo (nothanks.delete@this.xxx.com) on August 14, 2014 8:30 am wrote:
> > > David Kanter (dkanter.delete@this.realworldtech.com) on August 13, 2014 11:47 am wrote:
> > > > > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > > > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > > > > development costs required to build an x86-based server chip based on a new micro-architecture.
> > > > >
> > > > > An interesting video confirms what you are saying. Search for: Jim
> > > > > Keller On AMD's Next-Gen High Performance x86 & K12 ARM Cores.
> > > > >
> > > > > Saw this couple of months, but if my memory serves me correctly, he said that with the
> > > > > same transistor budget he is able to build a faster core with Aarch64 than with x86_64.
> > > > > Time will tell if that is true, but Jim seems to know what he is talking about.
> > > >
> > > > Jim is 100% right - it is a bit easier to design an ARMv8 core than
> > > > x86, all things being equal. How much is the difference though?
> > > >
> > > > I wrote about this extensively before:
> > > >
> > > > http://www.realworldtech.com/microservers/4/
> > > >
> > > > My analysis is as follows: assume a 15% gain for an ARM core vs. x86 (I think 5-10% is more realistic, but
> > > > let's be generous), that is only a 5% gain at the chip level. 5% just isn't a significant advantage.
> > >
> > > It won't matter for the folks that think that the ARM ISA provide a huge advantage when designing 10+ Watt
> > > chips, but ... about 10-ish years ago Microprocessor Report had an article that included a discussion with
> > > one of the POWER architects. He mentioned that for the
> > > space that POWER was competing, the ISA didn't matter
> > > enough(*) to be worth getting worked up over. It would be interesting to see if there was any elaboration
> > > (because I'm going off 10+ year old memory here), but that would require access to the article. Googling
> > > hasn't turned up anything (which isn't much of a surprise). This *IS* an argument from authority, but in
> > > this case armchair analysis by folks who have never built a high performance CPU is pretty suspect :-)
> > >
> > >
> > >
> > >
> > > (*) The implication was that the ISA wasn't a complete performance killer. A register-to-register
> > > architecture with only two registers would obviously be a non-trivial disadvantage. So might an ISA
> > > that *required* sequential instruction decoding (the 68K family was supposed to have this problem).
> >
> >
> > I am unsure if Jim Keller words are being considered here
> > an "armchair analysis", thus I will refer to hard data.
> >
> > In the first place it is worth mentioning that POWER guys have ignored the "ISA didn't
> > matter enough(*) to be worth getting worked up over" and have been developing the ISA
> > during last ten years. The current version of POWER ISA is the 2.06 (revision B).
> >
> > It is also evident that Intel has gained most of its performance from new ISAs. Under
> > x86 ISA the gains are 5% per year or so. Using new extensions such as TSX, AVX2... you
> > can see 2x performance gains in a single generation change (e.g. IVB --> HW).
> >
> > Intel knows that ISAs matter for performance and has given a recent talk about all
> > the new ISAs that is developing: AVX512F, AVX512{VL, DQ, BW}, CDI, ERI, PFI,...
> >
> > You mention Microprocessor Report. Precisely the guys at Microprocessor Report estimated the
> > Cortex A57 would gain 10% performance when running in ARM AArch64 rather than AArch32 mode
> >
> > Does ISA Matter for Performance?
> >
> > I.e. the same processor core, the same program, just compiled using a different set of
> > instructions. Code rewritten and optimized for the new ARM ISA produces bigger gains.
>
> As a technical matter, the changes to POWER ISA have all, as far as I know, been on
> the OS side; mostly to give the hypervisor better control over paging. I think there's
> been one change to the collection of user-level sync instructions. I agree with your
> stance that ISA matters, but I don't think this aspect of POWER proves the point.
The "Does ISA Matter for Performance?" article linked above mentions one of the changes made to the POWER ISA for improving performance. This is the relevant fragment:
Still ISA matters. And even within a family, ISA changes can have a huge impact on performance.
[...]
Inside a mainstream ISA, the trend of the last decade has been the successive addition of small and large sets of specialized instructions for doing various important computations. [...] Power Architecture has included binary-coded-decimal instructions, and IBM mainframes have instructions that do string copies in the L3 cache. Configurable architectures like Tensilica show the value in adding a few well-chosen application-specific instructions. Adding features to existing ISAs has been proven to have very good bang for the buck.
> Also, changing vector length isn't especially interesting in this context. No-one denies that
> just doubling the length of NEON vectors would improve a certain collection of code, and quadrupling
> them would give further improvement. But that's not especially interesting; in the same way
> that doubling the number of CPUs on your chip is not especially interesting.
> More interesting would be some way to quantify the improvement gained by additional "different" types of instructions
> that Intel has added. AVX is such a monster I can't tell what's what there and what was added when; but as
> far as I can tell until AVX-512 they didn't have a decent generic Permute, only a collection of various specialized
> shuffles. One could try, for example, to quantify the value of generic permute as opposed to limiting the compiler
> to, say, the collection of SSE4.1 shuffles and their 256/512 bit counterparts.
> One could do the same sort of reckoning for scatter and gather, again genuinely
> new types of instructions, or the new K-register masked instructions.
The new ISAs developed by Intel CDI, ERI, PFI are not about doubling the size of vectors, but look for increasing performance in concrete cases. The original AVX512F doubles the size to 512bits, but the new AVX512{VL, DQ, BW} maintain the same 512bits and introduce other features to improve performance in specific cases where AVX512F is not enough.
> On the POWER side, something closer to what I think we all have in mind is the mini-graph
> processing that was added to POWER8 to provide macro-instruction fusion and thereby provide
> better support for wide immediates (especially immediates used to load from the GOT).
> I mention this because it shows, in a sense, the flip side to the x86 experience. Where x86 can (and
> does) more or less easily add new capabilities through new instructions, RISC CPUs can add (in a backward
> compatible fashion) SOME additional capabilities through macro-instruction fusion/mini-graph processing.
> POWER8 has done this in a very limited and specialized way, but like anything, the first step is the
> hardest. I could imagine them combing through instruction traces looking for common instruction pairs
> and if there are others that are hot enough, giving those also the fusion treatment.