By: anon (anon.delete@this.anon.com), August 9, 2014 4:03 am
Room: Moderated Discussions
Megol (golem960.delete@this.gmail.com) on August 9, 2014 4:22 am wrote:
> anon (anon.delete@this.anon.com) on August 9, 2014 12:29 am wrote:
> > Megol (golem960.delete@this.gmail.com) on August 8, 2014 11:23 am wrote:
> > > juanrga (nospam.delete@this.juanrga.com) on August 8, 2014 10:49 am wrote:
> > > > anon (anon.delete@this.anon.com) on August 6, 2014 7:54 pm wrote:
> >
> > > > > I have also heard from many people (it's possible this is just an uninformed 'echo chamber effect',
> > > > > but I think there is some merit to the idea) that x86 cores take significantly more design skill
> > > > > than an equivalent ARM core. Whether this is due to compatibility, or decoders, or necessity of
> > > > > more capable memory pipline and caches, I don't know, but it seems to also be an x86 tax.
> > > >
> > > > E.g. a x86 decoder is more difficult to implement than an ARM64 decoder, because the former has to match
> > > > instructions of variable length.
> > >
> > > True. But the way to handle this is well known nowadays,
> >
> > That's a complete non-point, and it does not mean that no disadvantage exists. You could just
> > as well say that Intel "handed this well" with the Pentium or 386, for some values of "well".
> >
> > Atoms are 2 wide, even the SMT Atom is only 2 wide! While ARM went to 3 wide rather easily.
>
> First: you can't directly compare the CISC ISA width with the RISC one.
First: you can. Easily. Because you know that ARM dynamic instruction count is quite comparable to x86, and even the paper being discussed shows that (except for some exceptions that are due to x86 microcoded instructions like the transcendentals on SPECfp, or poor code generation).
Dynamic instruction count to solve a given problem is literally the final word in semantic expressiveness of instructions. If the ARM decoder can take fewer cycles to decode a compiled program than x86, then it objectively has the better throughput for that case. (obviously there are many more variables for "goodness", but simply talking about decode width in this paragraph).
> Second: 2 wide
> is sort of a sweet spot - going wider is expending resources with diminishing returns.
Of course, and again, a non-statement. 2-wide *is* a sweet spot at that level... for x86. Because decoding is hard.
>
> The AMD Bobcat and relatives are also officially 2 wide decode. But it can do the same
> work as 4 RISC instructions, two integer operations and two memory operations.
Look, obviously just taking datapoints and making assumptions/extrapolations from there is never going to give conclusive evidence one way or the other. That said, if you think x86 decoding is _not_ a notable disadvantage for the ISA, then you're just completely out to lunch, and against what any credible person I've ever seen (not me, an anonymous internet poster is not credible, but former Intel engineers for example, are).
>
> > > one way is using massive parallel
> > > length decoding, another is to use predecode data and tag each byte. There have been
> > > arguments that the later technique can scale up to 8 instructions decoded/clock with
> > > most complexity being those things a RISC also need (tracking dependencies++).
> >
> > The big Intel cores use significant complexity to tackle the problem and they're stuck
> > at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> > on its target workloads). Not that this is attributable to decoder alone or x86 tax
> > at all necessarily, but just to head off any claim of it being a furnace.
>
> Intel haven't tried to extend decoding beyond 4 instructions/cycle
How do you know?
> so I don't know what
> you mean here. They have temporarily compensated the decode bandwidth with the µop cache
> that have other advantages too inclusive lower power consumption for hot loops.
For x86 decoders.
> The idea that going 8 wide isn't a problem is funny given the attempts
> in the past and the well known scaling problems in the area.
For x86 decoders.
>
> > I don't know what you mean by "tracking dependencies++", but there is
> > no indication that POWER8 uses a uop cache, so you're simply wrong.
>
> What? When decoding a non-explicit ISA one have to track dependencies between instruction slots.
Rubbish. The only dependencies in instruction decoding are for x86, because variable length instructions means encoding of instruction depends on encoding of previous instructions. Fixed width has no such issues. Dual width has obviously dependant pairs, but that's obviously easier to scale up.
> The complexity
> of doing that scales n^2 for normal techniques. The difference between a 8 issue x86 and an 8 issue RISC
> is that the x86 have to split instructions into "lanes", after that the complexity is comparable.
> It have nothing to do with µop caches.
>
> > >
> > > >Also the x86 ISA is full of legacy instructions, which have to be implemented
> > > > in hardware and then verified/tested which increases development costs and time of development.
> > >
> > > Wrong. Legacy instructions need some hardware, true. But most of the functionality
> > > is implemented in microcode instead of adding complex hardware.
> > > Now there are some quirks in the x86 ISA that does waste power like handling
> > > of shift by zero, calculating the auxilary flag (nibble carry) etc. But those
> > > are far from the most power consuming parts of an OoO processor core.
> > >
> > > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > > development costs required to build an x86-based server chip based on a new micro-architecture.
> > >
> > > Now that's 100% true. X86 is a complex beast to implement
> > > and much of the complexities aren't really documented.
> > > But those undocumented things are used, knowingly or otherwise, and have to be supported.
> > >
> >
> > Well, obviously they're well documented inside Intel and AMD nowadays, and that's all that really matters.
> > That's not the cost of implementation. The cost is in the jet engines required to make the pig fly.
>
> Yeah... The problem for that argument that there are no jet engines. There are no special
> things that only Intel can do and only does it to make the x86 ISA competitive.
Yes there is. Big decoders, high performance microcode modes, complex decoded instruction caches, big and capable store forwarding, memory disambiguation, stack tracking, memory speculation, etc.
Many of these things are good regardless of ISA, but also other ISAs did not have to implement them. POWER6 had no store forwarding, POWER7 (I believe) did not have memory disambiguation, none of uop caches.
> Look at the reality: x86 processors are among the highest performing with the lowest cost to comparative
> performance alternatives.
I would ask you to do the same thing: x86 processors are among the highest performing in space they have been targeting for the past several decades. From somewhere around Atom space to all the way down to ~10,000 gate microcontrollers, x86 is anywhere from uncompetitive to completely impossible.
> It the ISA has overheads
Oh, so it does have overheads now?
> but those are in percentages nowadays
Evidence?
> which can be
> more than compensated for by the development costs and larger chips the mass market can afford.
Evidence? Certainly for smartphone space, the market says you're wrong.
> anon (anon.delete@this.anon.com) on August 9, 2014 12:29 am wrote:
> > Megol (golem960.delete@this.gmail.com) on August 8, 2014 11:23 am wrote:
> > > juanrga (nospam.delete@this.juanrga.com) on August 8, 2014 10:49 am wrote:
> > > > anon (anon.delete@this.anon.com) on August 6, 2014 7:54 pm wrote:
> >
> > > > > I have also heard from many people (it's possible this is just an uninformed 'echo chamber effect',
> > > > > but I think there is some merit to the idea) that x86 cores take significantly more design skill
> > > > > than an equivalent ARM core. Whether this is due to compatibility, or decoders, or necessity of
> > > > > more capable memory pipline and caches, I don't know, but it seems to also be an x86 tax.
> > > >
> > > > E.g. a x86 decoder is more difficult to implement than an ARM64 decoder, because the former has to match
> > > > instructions of variable length.
> > >
> > > True. But the way to handle this is well known nowadays,
> >
> > That's a complete non-point, and it does not mean that no disadvantage exists. You could just
> > as well say that Intel "handed this well" with the Pentium or 386, for some values of "well".
> >
> > Atoms are 2 wide, even the SMT Atom is only 2 wide! While ARM went to 3 wide rather easily.
>
> First: you can't directly compare the CISC ISA width with the RISC one.
First: you can. Easily. Because you know that ARM dynamic instruction count is quite comparable to x86, and even the paper being discussed shows that (except for some exceptions that are due to x86 microcoded instructions like the transcendentals on SPECfp, or poor code generation).
Dynamic instruction count to solve a given problem is literally the final word in semantic expressiveness of instructions. If the ARM decoder can take fewer cycles to decode a compiled program than x86, then it objectively has the better throughput for that case. (obviously there are many more variables for "goodness", but simply talking about decode width in this paragraph).
> Second: 2 wide
> is sort of a sweet spot - going wider is expending resources with diminishing returns.
Of course, and again, a non-statement. 2-wide *is* a sweet spot at that level... for x86. Because decoding is hard.
>
> The AMD Bobcat and relatives are also officially 2 wide decode. But it can do the same
> work as 4 RISC instructions, two integer operations and two memory operations.
Look, obviously just taking datapoints and making assumptions/extrapolations from there is never going to give conclusive evidence one way or the other. That said, if you think x86 decoding is _not_ a notable disadvantage for the ISA, then you're just completely out to lunch, and against what any credible person I've ever seen (not me, an anonymous internet poster is not credible, but former Intel engineers for example, are).
>
> > > one way is using massive parallel
> > > length decoding, another is to use predecode data and tag each byte. There have been
> > > arguments that the later technique can scale up to 8 instructions decoded/clock with
> > > most complexity being those things a RISC also need (tracking dependencies++).
> >
> > The big Intel cores use significant complexity to tackle the problem and they're stuck
> > at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> > on its target workloads). Not that this is attributable to decoder alone or x86 tax
> > at all necessarily, but just to head off any claim of it being a furnace.
>
> Intel haven't tried to extend decoding beyond 4 instructions/cycle
How do you know?
> so I don't know what
> you mean here. They have temporarily compensated the decode bandwidth with the µop cache
> that have other advantages too inclusive lower power consumption for hot loops.
For x86 decoders.
> The idea that going 8 wide isn't a problem is funny given the attempts
> in the past and the well known scaling problems in the area.
For x86 decoders.
>
> > I don't know what you mean by "tracking dependencies++", but there is
> > no indication that POWER8 uses a uop cache, so you're simply wrong.
>
> What? When decoding a non-explicit ISA one have to track dependencies between instruction slots.
Rubbish. The only dependencies in instruction decoding are for x86, because variable length instructions means encoding of instruction depends on encoding of previous instructions. Fixed width has no such issues. Dual width has obviously dependant pairs, but that's obviously easier to scale up.
> The complexity
> of doing that scales n^2 for normal techniques. The difference between a 8 issue x86 and an 8 issue RISC
> is that the x86 have to split instructions into "lanes", after that the complexity is comparable.
> It have nothing to do with µop caches.
>
> > >
> > > >Also the x86 ISA is full of legacy instructions, which have to be implemented
> > > > in hardware and then verified/tested which increases development costs and time of development.
> > >
> > > Wrong. Legacy instructions need some hardware, true. But most of the functionality
> > > is implemented in microcode instead of adding complex hardware.
> > > Now there are some quirks in the x86 ISA that does waste power like handling
> > > of shift by zero, calculating the auxilary flag (nibble carry) etc. But those
> > > are far from the most power consuming parts of an OoO processor core.
> > >
> > > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > > development costs required to build an x86-based server chip based on a new micro-architecture.
> > >
> > > Now that's 100% true. X86 is a complex beast to implement
> > > and much of the complexities aren't really documented.
> > > But those undocumented things are used, knowingly or otherwise, and have to be supported.
> > >
> >
> > Well, obviously they're well documented inside Intel and AMD nowadays, and that's all that really matters.
> > That's not the cost of implementation. The cost is in the jet engines required to make the pig fly.
>
> Yeah... The problem for that argument that there are no jet engines. There are no special
> things that only Intel can do and only does it to make the x86 ISA competitive.
Yes there is. Big decoders, high performance microcode modes, complex decoded instruction caches, big and capable store forwarding, memory disambiguation, stack tracking, memory speculation, etc.
Many of these things are good regardless of ISA, but also other ISAs did not have to implement them. POWER6 had no store forwarding, POWER7 (I believe) did not have memory disambiguation, none of uop caches.
> Look at the reality: x86 processors are among the highest performing with the lowest cost to comparative
> performance alternatives.
I would ask you to do the same thing: x86 processors are among the highest performing in space they have been targeting for the past several decades. From somewhere around Atom space to all the way down to ~10,000 gate microcontrollers, x86 is anywhere from uncompetitive to completely impossible.
> It the ISA has overheads
Oh, so it does have overheads now?
> but those are in percentages nowadays
Evidence?
> which can be
> more than compensated for by the development costs and larger chips the mass market can afford.
Evidence? Certainly for smartphone space, the market says you're wrong.