By: Megol (golem960.delete@this.gmail.com), August 9, 2014 4:22 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on August 9, 2014 12:29 am wrote:
> Megol (golem960.delete@this.gmail.com) on August 8, 2014 11:23 am wrote:
> > juanrga (nospam.delete@this.juanrga.com) on August 8, 2014 10:49 am wrote:
> > > anon (anon.delete@this.anon.com) on August 6, 2014 7:54 pm wrote:
>
> > > > I have also heard from many people (it's possible this is just an uninformed 'echo chamber effect',
> > > > but I think there is some merit to the idea) that x86 cores take significantly more design skill
> > > > than an equivalent ARM core. Whether this is due to compatibility, or decoders, or necessity of
> > > > more capable memory pipline and caches, I don't know, but it seems to also be an x86 tax.
> > >
> > > E.g. a x86 decoder is more difficult to implement than an ARM64 decoder, because the former has to match
> > > instructions of variable length.
> >
> > True. But the way to handle this is well known nowadays,
>
> That's a complete non-point, and it does not mean that no disadvantage exists. You could just
> as well say that Intel "handed this well" with the Pentium or 386, for some values of "well".
>
> Atoms are 2 wide, even the SMT Atom is only 2 wide! While ARM went to 3 wide rather easily.
First: you can't directly compare the CISC ISA width with the RISC one. Second: 2 wide is sort of a sweet spot - going wider is expending resources with diminishing returns.
The AMD Bobcat and relatives are also officially 2 wide decode. But it can do the same work as 4 RISC instructions, two integer operations and two memory operations.
> > one way is using massive parallel
> > length decoding, another is to use predecode data and tag each byte. There have been
> > arguments that the later technique can scale up to 8 instructions decoded/clock with
> > most complexity being those things a RISC also need (tracking dependencies++).
>
> The big Intel cores use significant complexity to tackle the problem and they're stuck
> at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> on its target workloads). Not that this is attributable to decoder alone or x86 tax
> at all necessarily, but just to head off any claim of it being a furnace.
Intel haven't tried to extend decoding beyond 4 instructions/cycle so I don't know what you mean here. They have temporarily compensated the decode bandwidth with the µop cache that have other advantages too inclusive lower power consumption for hot loops.
The idea that going 8 wide isn't a problem is funny given the attempts in the past and the well known scaling problems in the area.
> I don't know what you mean by "tracking dependencies++", but there is
> no indication that POWER8 uses a uop cache, so you're simply wrong.
What? When decoding a non-explicit ISA one have to track dependencies between instruction slots. The complexity of doing that scales n^2 for normal techniques. The difference between a 8 issue x86 and an 8 issue RISC is that the x86 have to split instructions into "lanes", after that the complexity is comparable.
It have nothing to do with µop caches.
> >
> > >Also the x86 ISA is full of legacy instructions, which have to be implemented
> > > in hardware and then verified/tested which increases development costs and time of development.
> >
> > Wrong. Legacy instructions need some hardware, true. But most of the functionality
> > is implemented in microcode instead of adding complex hardware.
> > Now there are some quirks in the x86 ISA that does waste power like handling
> > of shift by zero, calculating the auxilary flag (nibble carry) etc. But those
> > are far from the most power consuming parts of an OoO processor core.
> >
> > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > development costs required to build an x86-based server chip based on a new micro-architecture.
> >
> > Now that's 100% true. X86 is a complex beast to implement
> > and much of the complexities aren't really documented.
> > But those undocumented things are used, knowingly or otherwise, and have to be supported.
> >
>
> Well, obviously they're well documented inside Intel and AMD nowadays, and that's all that really matters.
> That's not the cost of implementation. The cost is in the jet engines required to make the pig fly.
Yeah... The problem for that argument that there are no jet engines. There are no special things that only Intel can do and only does it to make the x86 ISA competitive.
Look at the reality: x86 processors are among the highest performing with the lowest cost to comparative performance alternatives. It the ISA has overheads but those are in percentages nowadays which can be more than compensated for by the development costs and larger chips the mass market can afford.
> Megol (golem960.delete@this.gmail.com) on August 8, 2014 11:23 am wrote:
> > juanrga (nospam.delete@this.juanrga.com) on August 8, 2014 10:49 am wrote:
> > > anon (anon.delete@this.anon.com) on August 6, 2014 7:54 pm wrote:
>
> > > > I have also heard from many people (it's possible this is just an uninformed 'echo chamber effect',
> > > > but I think there is some merit to the idea) that x86 cores take significantly more design skill
> > > > than an equivalent ARM core. Whether this is due to compatibility, or decoders, or necessity of
> > > > more capable memory pipline and caches, I don't know, but it seems to also be an x86 tax.
> > >
> > > E.g. a x86 decoder is more difficult to implement than an ARM64 decoder, because the former has to match
> > > instructions of variable length.
> >
> > True. But the way to handle this is well known nowadays,
>
> That's a complete non-point, and it does not mean that no disadvantage exists. You could just
> as well say that Intel "handed this well" with the Pentium or 386, for some values of "well".
>
> Atoms are 2 wide, even the SMT Atom is only 2 wide! While ARM went to 3 wide rather easily.
First: you can't directly compare the CISC ISA width with the RISC one. Second: 2 wide is sort of a sweet spot - going wider is expending resources with diminishing returns.
The AMD Bobcat and relatives are also officially 2 wide decode. But it can do the same work as 4 RISC instructions, two integer operations and two memory operations.
> > one way is using massive parallel
> > length decoding, another is to use predecode data and tag each byte. There have been
> > arguments that the later technique can scale up to 8 instructions decoded/clock with
> > most complexity being those things a RISC also need (tracking dependencies++).
>
> The big Intel cores use significant complexity to tackle the problem and they're stuck
> at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> on its target workloads). Not that this is attributable to decoder alone or x86 tax
> at all necessarily, but just to head off any claim of it being a furnace.
Intel haven't tried to extend decoding beyond 4 instructions/cycle so I don't know what you mean here. They have temporarily compensated the decode bandwidth with the µop cache that have other advantages too inclusive lower power consumption for hot loops.
The idea that going 8 wide isn't a problem is funny given the attempts in the past and the well known scaling problems in the area.
> I don't know what you mean by "tracking dependencies++", but there is
> no indication that POWER8 uses a uop cache, so you're simply wrong.
What? When decoding a non-explicit ISA one have to track dependencies between instruction slots. The complexity of doing that scales n^2 for normal techniques. The difference between a 8 issue x86 and an 8 issue RISC is that the x86 have to split instructions into "lanes", after that the complexity is comparable.
It have nothing to do with µop caches.
> >
> > >Also the x86 ISA is full of legacy instructions, which have to be implemented
> > > in hardware and then verified/tested which increases development costs and time of development.
> >
> > Wrong. Legacy instructions need some hardware, true. But most of the functionality
> > is implemented in microcode instead of adding complex hardware.
> > Now there are some quirks in the x86 ISA that does waste power like handling
> > of shift by zero, calculating the auxilary flag (nibble carry) etc. But those
> > are far from the most power consuming parts of an OoO processor core.
> >
> > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > development costs required to build an x86-based server chip based on a new micro-architecture.
> >
> > Now that's 100% true. X86 is a complex beast to implement
> > and much of the complexities aren't really documented.
> > But those undocumented things are used, knowingly or otherwise, and have to be supported.
> >
>
> Well, obviously they're well documented inside Intel and AMD nowadays, and that's all that really matters.
> That's not the cost of implementation. The cost is in the jet engines required to make the pig fly.
Yeah... The problem for that argument that there are no jet engines. There are no special things that only Intel can do and only does it to make the x86 ISA competitive.
Look at the reality: x86 processors are among the highest performing with the lowest cost to comparative performance alternatives. It the ISA has overheads but those are in percentages nowadays which can be more than compensated for by the development costs and larger chips the mass market can afford.