By: David Kanter (dkanter.delete@this.realworldtech.com), August 9, 2014 10:50 am
Room: Moderated Discussions
juanrga (nospam.delete@this.juanrga.com) on August 9, 2014 6:38 am wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on August 8, 2014 11:36 pm wrote:
> > > > Which is obviously something a marketing/executive person would say, but it's also completely false. The
> > > > *concept* of an x86 tax is absolutely true. And not just the concept, but even in implementation, we can
> > > > take a really simple example which is the instruction decoding complexity, and point to that.[*]
> > > >
> > > > There have also been engineers in the past acknowledge some inefficiencies and estimate
> > > > "x86 tax" for then-PC class designs. Whether those are still valid with x64 and ever more
> > > > complex CPUs is up for debate, but certainly the *concept* of an x86 tax is there.
> > > >
> > > > It's also not really disputed that at the very small scale, x86
> > > > designs can't compete with simple ARM based microarchitectures.
> > >
> > > Take a modern A57 core. According to AMD the A57 Opteron is faster than jaguar based Opteron but
> > > consumes less power. The ARM core performance is ~40% faster, and consumes roughly one half.
> > >
> > > Jaguar is considered a good x86 design and even competitive against
> > > Intel last designs. Thus we are seeing the x86 tax in action.
> >
> > No you aren't. Jaguar is a good core design, but the uncore was inappropriate for
> > servers. You are comparing a design with a server-specific uncore vs. one with a client-optimized
> > uncore. It's no surprise that the Jaguar-based design is behind.
>
> This is a non-issue. The same will happen when you compare desktops, laptops, or tablets using A57
> or Jaguar. The ARM core will be faster and efficient than the x86 core, despite the latter is a very
> good design in x86 space and the former is only the first standard 64bit core. When custom ARM is compared
> to jaguar then things look even poor for x86. Check Anand review of cyclone, for instance.
You are trying to make the following argument:
A) A57 Opteron is faster than Jaguar Opteron --> B) A57 core is faster than Jaguar core --> C) ARM ISA is intrinsically higher performance than x86.
That chain of reasoning is totally broken:
A57 Opteron has a totally different uncore that is better tailored for servers. So A does not imply B. The A57 Opteron has a totally different cache hierarchy, memory controller, etc. that yields much higher performance. Also, Jaguar wasn't designed for servers at all.
So you're just plain wrong.
> > ARM is full of legacy crap as well. Not to mention the fact that an ARMv8 requires
> > 3-4 different decoders. I know a few people who have had the pleasure of designing
> > custom ARM cores, and according to them 'ARMv8 decode is just as terrible as x86'.
>
> By ARM64 I am referring to AArch64 exclusively. ARMv8 can be A or T and it includes AArch32 for >legacy.
Can you name a design which is ARM64 only with no 32-bit support? I can't. That means that decoders are needed for ARMv8, v7, and thumb.
> The designs that I am commenting are pure AArch64 implementations,
> legacy 32bit mode is not needed for HPC for instance.
OK, so you only want to talk about HPC then?
> > There really isn't a significant x86 tax. Perhaps 5% for a reasonable
> > core (obviously things are much worse for scalar cores).
>
> I have info from Intel that says otherwise and I know that he is painting in rose.
I don't believe you. I know a number of architects (e.g., Greg Favor, Mike Fillipo) who have designed both x86 and ARM cores, and everyone says the same thing. ISA doesn't really matter for performance. x86 takes a bit more effort, but its more validation than anything else.
>According to him
> legacy support already accounts for one-third of the energy of integer execution.This doesn't include
> fetch-decode energy, which sums up to about 2/3. Moreover his numbers perpetuate the myth that x86 tax
> in only in the decode: part of the energy associated to execution has a penalty due to the ISA.
That's rubbish. Explain to me how executing a register-register ADD instruction is significantly more expensive for x86 than ARM. Seriously - try it. There's no difference for most instructions.
ARM implementations tend to use uops and have plenty of complex instructions to deal with (e.g., LDM, STM).
> > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > development costs required to build an x86-based server chip based on a new micro-architecture.
> >
> > Those numbers are suspect and also probably not comparing the right things. Much
> > of the cost of a server design is in the cache, coherent interconnects, memory controller,
> > power management, etc. which is necessary for any design, ARM or x86.
>
> Apparently you have not heard of AMD AMBIDEXTROUS strategy. Only the core
> changes, the rest of the chip is the same up to the pin level.
You tried arguing that ARM servers are cheaper to design than x86 servers based on Andrew Feldman's statements. My point is it's the same cost. The only thing that is different is that the ARM core is a bit simpler. All the 'uncore' is equally expensive for ARM and x86. A good L3 cache doesn't know anything about ARM or x86. It's a cache.
How does AMD's strategy mean anything about the relative cost of ARM and x86 servers? Doing a server using an already designed core (e.g., a licensed core, or one that was already built for desktop/notebook) is cheaper than designing a core from scratch.
> Above numbers are credible. They are the reason why so many companies are doing competitive server/HPC
> designs. They are the reason that K12 core comes first and the zen core comes latter.
How do you know they are competitive? Have you seen performance for ARM-based servers in general availability? Are those numbers competitive with Ivy Bridge-EP?
Most ARM-based servers are currently implemented as PPT files and half complete projects, so the performance is indeterminate. They could hit their targets, but it is most likely they will not. First generation products typically tend to have teething difficulties.
The exception would be Applied Micro (which isn't quite in GA yet), and AMD (also not in GA). So if you want to talk about Vulcan or Thunder, those need to be compared against Haswell-EP or Skylake-EP - because those are the products they will compete against.
David
> David Kanter (dkanter.delete@this.realworldtech.com) on August 8, 2014 11:36 pm wrote:
> > > > Which is obviously something a marketing/executive person would say, but it's also completely false. The
> > > > *concept* of an x86 tax is absolutely true. And not just the concept, but even in implementation, we can
> > > > take a really simple example which is the instruction decoding complexity, and point to that.[*]
> > > >
> > > > There have also been engineers in the past acknowledge some inefficiencies and estimate
> > > > "x86 tax" for then-PC class designs. Whether those are still valid with x64 and ever more
> > > > complex CPUs is up for debate, but certainly the *concept* of an x86 tax is there.
> > > >
> > > > It's also not really disputed that at the very small scale, x86
> > > > designs can't compete with simple ARM based microarchitectures.
> > >
> > > Take a modern A57 core. According to AMD the A57 Opteron is faster than jaguar based Opteron but
> > > consumes less power. The ARM core performance is ~40% faster, and consumes roughly one half.
> > >
> > > Jaguar is considered a good x86 design and even competitive against
> > > Intel last designs. Thus we are seeing the x86 tax in action.
> >
> > No you aren't. Jaguar is a good core design, but the uncore was inappropriate for
> > servers. You are comparing a design with a server-specific uncore vs. one with a client-optimized
> > uncore. It's no surprise that the Jaguar-based design is behind.
>
> This is a non-issue. The same will happen when you compare desktops, laptops, or tablets using A57
> or Jaguar. The ARM core will be faster and efficient than the x86 core, despite the latter is a very
> good design in x86 space and the former is only the first standard 64bit core. When custom ARM is compared
> to jaguar then things look even poor for x86. Check Anand review of cyclone, for instance.
You are trying to make the following argument:
A) A57 Opteron is faster than Jaguar Opteron --> B) A57 core is faster than Jaguar core --> C) ARM ISA is intrinsically higher performance than x86.
That chain of reasoning is totally broken:
A57 Opteron has a totally different uncore that is better tailored for servers. So A does not imply B. The A57 Opteron has a totally different cache hierarchy, memory controller, etc. that yields much higher performance. Also, Jaguar wasn't designed for servers at all.
So you're just plain wrong.
> > ARM is full of legacy crap as well. Not to mention the fact that an ARMv8 requires
> > 3-4 different decoders. I know a few people who have had the pleasure of designing
> > custom ARM cores, and according to them 'ARMv8 decode is just as terrible as x86'.
>
> By ARM64 I am referring to AArch64 exclusively. ARMv8 can be A or T and it includes AArch32 for >legacy.
Can you name a design which is ARM64 only with no 32-bit support? I can't. That means that decoders are needed for ARMv8, v7, and thumb.
> The designs that I am commenting are pure AArch64 implementations,
> legacy 32bit mode is not needed for HPC for instance.
OK, so you only want to talk about HPC then?
> > There really isn't a significant x86 tax. Perhaps 5% for a reasonable
> > core (obviously things are much worse for scalar cores).
>
> I have info from Intel that says otherwise and I know that he is painting in rose.
I don't believe you. I know a number of architects (e.g., Greg Favor, Mike Fillipo) who have designed both x86 and ARM cores, and everyone says the same thing. ISA doesn't really matter for performance. x86 takes a bit more effort, but its more validation than anything else.
>According to him
> legacy support already accounts for one-third of the energy of integer execution.This doesn't include
> fetch-decode energy, which sums up to about 2/3. Moreover his numbers perpetuate the myth that x86 tax
> in only in the decode: part of the energy associated to execution has a penalty due to the ISA.
That's rubbish. Explain to me how executing a register-register ADD instruction is significantly more expensive for x86 than ARM. Seriously - try it. There's no difference for most instructions.
ARM implementations tend to use uops and have plenty of complex instructions to deal with (e.g., LDM, STM).
> > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > development costs required to build an x86-based server chip based on a new micro-architecture.
> >
> > Those numbers are suspect and also probably not comparing the right things. Much
> > of the cost of a server design is in the cache, coherent interconnects, memory controller,
> > power management, etc. which is necessary for any design, ARM or x86.
>
> Apparently you have not heard of AMD AMBIDEXTROUS strategy. Only the core
> changes, the rest of the chip is the same up to the pin level.
You tried arguing that ARM servers are cheaper to design than x86 servers based on Andrew Feldman's statements. My point is it's the same cost. The only thing that is different is that the ARM core is a bit simpler. All the 'uncore' is equally expensive for ARM and x86. A good L3 cache doesn't know anything about ARM or x86. It's a cache.
How does AMD's strategy mean anything about the relative cost of ARM and x86 servers? Doing a server using an already designed core (e.g., a licensed core, or one that was already built for desktop/notebook) is cheaper than designing a core from scratch.
> Above numbers are credible. They are the reason why so many companies are doing competitive server/HPC
> designs. They are the reason that K12 core comes first and the zen core comes latter.
How do you know they are competitive? Have you seen performance for ARM-based servers in general availability? Are those numbers competitive with Ivy Bridge-EP?
Most ARM-based servers are currently implemented as PPT files and half complete projects, so the performance is indeterminate. They could hit their targets, but it is most likely they will not. First generation products typically tend to have teething difficulties.
The exception would be Applied Micro (which isn't quite in GA yet), and AMD (also not in GA). So if you want to talk about Vulcan or Thunder, those need to be compared against Haswell-EP or Skylake-EP - because those are the products they will compete against.
David