By: juanrga (nospam.delete@this.juanrga.com), August 10, 2014 3:32 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on August 9, 2014 10:50 am wrote:
> juanrga (nospam.delete@this.juanrga.com) on August 9, 2014 6:38 am wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on August 8, 2014 11:36 pm wrote:
> > > > > Which is obviously something a marketing/executive person would say, but it's also completely false. The
> > > > > *concept* of an x86 tax is absolutely true. And not just the concept, but even in implementation, we can
> > > > > take a really simple example which is the instruction decoding complexity, and point to that.[*]
> > > > >
> > > > > There have also been engineers in the past acknowledge some inefficiencies and estimate
> > > > > "x86 tax" for then-PC class designs. Whether those are still valid with x64 and ever more
> > > > > complex CPUs is up for debate, but certainly the *concept* of an x86 tax is there.
> > > > >
> > > > > It's also not really disputed that at the very small scale, x86
> > > > > designs can't compete with simple ARM based microarchitectures.
> > > >
> > > > Take a modern A57 core. According to AMD the A57 Opteron is faster than jaguar based Opteron but
> > > > consumes less power. The ARM core performance is ~40% faster, and consumes roughly one half.
> > > >
> > > > Jaguar is considered a good x86 design and even competitive against
> > > > Intel last designs. Thus we are seeing the x86 tax in action.
> > >
> > > No you aren't. Jaguar is a good core design, but the uncore was inappropriate for
> > > servers. You are comparing a design with a server-specific uncore vs. one with a client-optimized
> > > uncore. It's no surprise that the Jaguar-based design is behind.
> >
> > This is a non-issue. The same will happen when you compare desktops, laptops, or tablets using A57
> > or Jaguar. The ARM core will be faster and efficient than the x86 core, despite the latter is a very
> > good design in x86 space and the former is only the first standard 64bit core. When custom ARM is compared
> > to jaguar then things look even poor for x86. Check Anand review of cyclone, for instance.
>
> You are trying to make the following argument:
>
> A) A57 Opteron is faster than Jaguar Opteron --> B) A57 core is faster than
> Jaguar core --> C) ARM ISA is intrinsically higher performance than x86.
>
> That chain of reasoning is totally broken:
>
> A57 Opteron has a totally different uncore that is better tailored for servers. So A does
> not imply B. The A57 Opteron has a totally different cache hierarchy, memory controller, etc.
> that yields much higher performance. Also, Jaguar wasn't designed for servers at all.
>
> So you're just plain wrong.
If you read what I wrote (it is quoted above), my argument was different. I also answered the "jaguar wasn't designed for servers" flawed argument as well.
> > > ARM is full of legacy crap as well. Not to mention the fact that an ARMv8 requires
> > > 3-4 different decoders. I know a few people who have had the pleasure of designing
> > > custom ARM cores, and according to them 'ARMv8 decode is just as terrible as x86'.
> >
> > By ARM64 I am referring to AArch64 exclusively. ARMv8 can be A or T and it includes AArch32 for >legacy.
>
> Can you name a design which is ARM64 only with no 32-bit support? I can't.
> That means that decoders are needed for ARMv8, v7, and thumb.
The reason why ARM implemented 64bit as a separate ISA instead an extension of 32bit, is because it expect _future_ products to implement only the 64bit ISA eliminating legacy penalty completely.
I don't know about K12, but being a server exclusive product I see no sense on providing support for 32bit and thumb.
I can say you that one of the designs that I mentioned to you before implements a full AAarch64 decoder but only partial AAarch32 support. This SoC cannot run 32bit OS, for instance.
> > The designs that I am commenting are pure AArch64 implementations,
> > legacy 32bit mode is not needed for HPC for instance.
>
> OK, so you only want to talk about HPC then?
My first post started mentioning Xeon-like SoCs of 90W for server/HPC. How do I have to explain that we will not see any of those in phones/tablets/laptops?
> > > There really isn't a significant x86 tax. Perhaps 5% for a reasonable
> > > core (obviously things are much worse for scalar cores).
> >
> > I have info from Intel that says otherwise and I know that he is painting in rose.
>
> I don't believe you. I know a number of architects (e.g., Greg Favor, Mike Fillipo) who have
> designed both x86 and ARM cores, and everyone says the same thing. ISA doesn't really matter
> for performance. x86 takes a bit more effort, but its more validation than anything else.
The data will not change if you don't believe it. I know both software and hardware engineers that will say you that the ISA matters. The last one was Keller, who noted during AMD core conference that "ARMv8 doesn't require the same instruction decoding hardware as an x86 processor, leaving more room to concentrate on performance". He even mentioned that his K12 core will have a "wider engine" than its x86 sister core.
Moreover, I am convinced that Filippo (not "Fillipo") knows that the A57 is about 10% faster when running the new ISA. Same arch. same program and 10% faster due only to change of ISAon 64bit more than
> >According to him
> > legacy support already accounts for one-third of the energy of integer execution.This doesn't include
> > fetch-decode energy, which sums up to about 2/3. Moreover his numbers perpetuate the myth that x86 tax
> > in only in the decode: part of the energy associated to execution has a penalty due to the ISA.
>
> That's rubbish. Explain to me how executing a register-register ADD instruction is significantly more
> expensive for x86 than ARM. Seriously - try it. There's no difference for most instructions.
I wrote "integer execution", not execution-of-a-single-integer-instruction. The energy measured in the "integer execution" tag included everything needed to execute the non-FP code but was not counted under Fetch-decode tag, as cache access tag, or as OoO/branching tag.
Of course those numbers are for small cores. For cores twice bigger with doubled computation resources the legacy penalty reduces, but still exists.
> ARM implementations tend to use uops and have plenty of complex instructions to deal with (e.g., LDM, STM).
>
> > > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > > development costs required to build an x86-based server chip based on a new micro-architecture.
> > >
> > > Those numbers are suspect and also probably not comparing the right things. Much
> > > of the cost of a server design is in the cache, coherent interconnects, memory controller,
> > > power management, etc. which is necessary for any design, ARM or x86.
> >
> > Apparently you have not heard of AMD AMBIDEXTROUS strategy. Only the core
> > changes, the rest of the chip is the same up to the pin level.
>
> You tried arguing that ARM servers are cheaper to design than x86 servers based on Andrew
> Feldman's statements. My point is it's the same cost. The only thing that is different
> is that the ARM core is a bit simpler. All the 'uncore' is equally expensive for ARM
> and x86. A good L3 cache doesn't know anything about ARM or x86. It's a cache.
Effectively all the 'uncore' is essentially the same for x86 and for ARM. The difference in the cost (~10x) and the time of development (~2x) comes from the ISA alone: ARM core vs x86 core.
> How does AMD's strategy mean anything about the relative cost of ARM and x86 servers?
> Doing a server using an already designed core (e.g., a licensed core, or one that was
> already built for desktop/notebook) is cheaper than designing a core from scratch.
All the recent AMD switch from old x86 exclusive to new ARM/x86 ambidextry is the result of the advantages of the ARM ISA and surrounding ecosystem, as the AMD server head explained plenty of times.
> > Above numbers are credible. They are the reason why so many companies are doing competitive server/HPC
> > designs. They are the reason that K12 core comes first and the zen core comes latter.
>
> How do you know they are competitive? Have you seen performance for ARM-based servers
> in general availability? Are those numbers competitive with Ivy Bridge-EP?
Yes, I have seen numbers and yes are competitive against IB-EP and against Haswell-EP.
> juanrga (nospam.delete@this.juanrga.com) on August 9, 2014 6:38 am wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on August 8, 2014 11:36 pm wrote:
> > > > > Which is obviously something a marketing/executive person would say, but it's also completely false. The
> > > > > *concept* of an x86 tax is absolutely true. And not just the concept, but even in implementation, we can
> > > > > take a really simple example which is the instruction decoding complexity, and point to that.[*]
> > > > >
> > > > > There have also been engineers in the past acknowledge some inefficiencies and estimate
> > > > > "x86 tax" for then-PC class designs. Whether those are still valid with x64 and ever more
> > > > > complex CPUs is up for debate, but certainly the *concept* of an x86 tax is there.
> > > > >
> > > > > It's also not really disputed that at the very small scale, x86
> > > > > designs can't compete with simple ARM based microarchitectures.
> > > >
> > > > Take a modern A57 core. According to AMD the A57 Opteron is faster than jaguar based Opteron but
> > > > consumes less power. The ARM core performance is ~40% faster, and consumes roughly one half.
> > > >
> > > > Jaguar is considered a good x86 design and even competitive against
> > > > Intel last designs. Thus we are seeing the x86 tax in action.
> > >
> > > No you aren't. Jaguar is a good core design, but the uncore was inappropriate for
> > > servers. You are comparing a design with a server-specific uncore vs. one with a client-optimized
> > > uncore. It's no surprise that the Jaguar-based design is behind.
> >
> > This is a non-issue. The same will happen when you compare desktops, laptops, or tablets using A57
> > or Jaguar. The ARM core will be faster and efficient than the x86 core, despite the latter is a very
> > good design in x86 space and the former is only the first standard 64bit core. When custom ARM is compared
> > to jaguar then things look even poor for x86. Check Anand review of cyclone, for instance.
>
> You are trying to make the following argument:
>
> A) A57 Opteron is faster than Jaguar Opteron --> B) A57 core is faster than
> Jaguar core --> C) ARM ISA is intrinsically higher performance than x86.
>
> That chain of reasoning is totally broken:
>
> A57 Opteron has a totally different uncore that is better tailored for servers. So A does
> not imply B. The A57 Opteron has a totally different cache hierarchy, memory controller, etc.
> that yields much higher performance. Also, Jaguar wasn't designed for servers at all.
>
> So you're just plain wrong.
If you read what I wrote (it is quoted above), my argument was different. I also answered the "jaguar wasn't designed for servers" flawed argument as well.
> > > ARM is full of legacy crap as well. Not to mention the fact that an ARMv8 requires
> > > 3-4 different decoders. I know a few people who have had the pleasure of designing
> > > custom ARM cores, and according to them 'ARMv8 decode is just as terrible as x86'.
> >
> > By ARM64 I am referring to AArch64 exclusively. ARMv8 can be A or T and it includes AArch32 for >legacy.
>
> Can you name a design which is ARM64 only with no 32-bit support? I can't.
> That means that decoders are needed for ARMv8, v7, and thumb.
The reason why ARM implemented 64bit as a separate ISA instead an extension of 32bit, is because it expect _future_ products to implement only the 64bit ISA eliminating legacy penalty completely.
I don't know about K12, but being a server exclusive product I see no sense on providing support for 32bit and thumb.
I can say you that one of the designs that I mentioned to you before implements a full AAarch64 decoder but only partial AAarch32 support. This SoC cannot run 32bit OS, for instance.
> > The designs that I am commenting are pure AArch64 implementations,
> > legacy 32bit mode is not needed for HPC for instance.
>
> OK, so you only want to talk about HPC then?
My first post started mentioning Xeon-like SoCs of 90W for server/HPC. How do I have to explain that we will not see any of those in phones/tablets/laptops?
> > > There really isn't a significant x86 tax. Perhaps 5% for a reasonable
> > > core (obviously things are much worse for scalar cores).
> >
> > I have info from Intel that says otherwise and I know that he is painting in rose.
>
> I don't believe you. I know a number of architects (e.g., Greg Favor, Mike Fillipo) who have
> designed both x86 and ARM cores, and everyone says the same thing. ISA doesn't really matter
> for performance. x86 takes a bit more effort, but its more validation than anything else.
The data will not change if you don't believe it. I know both software and hardware engineers that will say you that the ISA matters. The last one was Keller, who noted during AMD core conference that "ARMv8 doesn't require the same instruction decoding hardware as an x86 processor, leaving more room to concentrate on performance". He even mentioned that his K12 core will have a "wider engine" than its x86 sister core.
Moreover, I am convinced that Filippo (not "Fillipo") knows that the A57 is about 10% faster when running the new ISA. Same arch. same program and 10% faster due only to change of ISAon 64bit more than
> >According to him
> > legacy support already accounts for one-third of the energy of integer execution.This doesn't include
> > fetch-decode energy, which sums up to about 2/3. Moreover his numbers perpetuate the myth that x86 tax
> > in only in the decode: part of the energy associated to execution has a penalty due to the ISA.
>
> That's rubbish. Explain to me how executing a register-register ADD instruction is significantly more
> expensive for x86 than ARM. Seriously - try it. There's no difference for most instructions.
I wrote "integer execution", not execution-of-a-single-integer-instruction. The energy measured in the "integer execution" tag included everything needed to execute the non-FP code but was not counted under Fetch-decode tag, as cache access tag, or as OoO/branching tag.
Of course those numbers are for small cores. For cores twice bigger with doubled computation resources the legacy penalty reduces, but still exists.
> ARM implementations tend to use uops and have plenty of complex instructions to deal with (e.g., LDM, STM).
>
> > > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > > development costs required to build an x86-based server chip based on a new micro-architecture.
> > >
> > > Those numbers are suspect and also probably not comparing the right things. Much
> > > of the cost of a server design is in the cache, coherent interconnects, memory controller,
> > > power management, etc. which is necessary for any design, ARM or x86.
> >
> > Apparently you have not heard of AMD AMBIDEXTROUS strategy. Only the core
> > changes, the rest of the chip is the same up to the pin level.
>
> You tried arguing that ARM servers are cheaper to design than x86 servers based on Andrew
> Feldman's statements. My point is it's the same cost. The only thing that is different
> is that the ARM core is a bit simpler. All the 'uncore' is equally expensive for ARM
> and x86. A good L3 cache doesn't know anything about ARM or x86. It's a cache.
Effectively all the 'uncore' is essentially the same for x86 and for ARM. The difference in the cost (~10x) and the time of development (~2x) comes from the ISA alone: ARM core vs x86 core.
> How does AMD's strategy mean anything about the relative cost of ARM and x86 servers?
> Doing a server using an already designed core (e.g., a licensed core, or one that was
> already built for desktop/notebook) is cheaper than designing a core from scratch.
All the recent AMD switch from old x86 exclusive to new ARM/x86 ambidextry is the result of the advantages of the ARM ISA and surrounding ecosystem, as the AMD server head explained plenty of times.
> > Above numbers are credible. They are the reason why so many companies are doing competitive server/HPC
> > designs. They are the reason that K12 core comes first and the zen core comes latter.
>
> How do you know they are competitive? Have you seen performance for ARM-based servers
> in general availability? Are those numbers competitive with Ivy Bridge-EP?
Yes, I have seen numbers and yes are competitive against IB-EP and against Haswell-EP.