By: Megol (golem960.delete@this.gmail.com), August 11, 2014 8:10 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on August 9, 2014 5:03 am wrote:
> Megol (golem960.delete@this.gmail.com) on August 9, 2014 4:22 am wrote:
> > anon (anon.delete@this.anon.com) on August 9, 2014 12:29 am wrote:
> > > Megol (golem960.delete@this.gmail.com) on August 8, 2014 11:23 am wrote:
> > > > juanrga (nospam.delete@this.juanrga.com) on August 8, 2014 10:49 am wrote:
> > > > > anon (anon.delete@this.anon.com) on August 6, 2014 7:54 pm wrote:
> > >
> > > > > > I have also heard from many people (it's possible this is just an uninformed 'echo chamber effect',
> > > > > > but I think there is some merit to the idea) that x86 cores take significantly more design skill
> > > > > > than an equivalent ARM core. Whether this is due to compatibility, or decoders, or necessity of
> > > > > > more capable memory pipline and caches, I don't know, but it seems to also be an x86 tax.
> > > > >
> > > > > E.g. a x86 decoder is more difficult to implement than an ARM64 decoder, because the former has to match
> > > > > instructions of variable length.
> > > >
> > > > True. But the way to handle this is well known nowadays,
> > >
> > > That's a complete non-point, and it does not mean that no disadvantage exists. You could just
> > > as well say that Intel "handed this well" with the Pentium or 386, for some values of "well".
> > >
> > > Atoms are 2 wide, even the SMT Atom is only 2 wide! While ARM went to 3 wide rather easily.
> >
> > First: you can't directly compare the CISC ISA width with the RISC one.
>
> First: you can. Easily. Because you know that ARM dynamic instruction count is quite comparable
> to x86, and even the paper being discussed shows that (except for some exceptions that are due
> to x86 microcoded instructions like the transcendentals on SPECfp, or poor code generation).
>
> Dynamic instruction count to solve a given problem is literally the final word in semantic expressiveness
> of instructions. If the ARM decoder can take fewer cycles to decode a compiled program than x86,
> then it objectively has the better throughput for that case. (obviously there are many more variables
> for "goodness", but simply talking about decode width in this paragraph).
Of course. However in practice that advantage gets dwarfed by other design choices.
Add a bit better branch predictor and suddenly doing a multi-cycle decode isn't a problem.
Add some mechanism to avoid re-decoding hot parts of programs -> branch latency of mispredicted branches aren't a problem.
And those things are better for all code in other ways.
> > Second: 2 wide
> > is sort of a sweet spot - going wider is expending resources with diminishing returns.
>
> Of course, and again, a non-statement. 2-wide *is* a sweet
> spot at that level... for x86. Because decoding is hard.
No. Because it is the sweet spot. Go read some papers on scaling of processor structures.
> >
> > The AMD Bobcat and relatives are also officially 2 wide decode. But it can do the same
> > work as 4 RISC instructions, two integer operations and two memory operations.
>
> Look, obviously just taking datapoints and making assumptions/extrapolations from there is never going to give
> conclusive evidence one way or the other. That said, if you think x86 decoding is _not_ a notable disadvantage
> for the ISA, then you're just completely out to lunch, and against what any credible person I've ever seen
> (not me, an anonymous internet poster is not credible, but former Intel engineers for example, are).
Of course it is a disadvantage, can you point out where I said(wrote) the opposite?
But for a realistic design the disadvantage isn't notable - the problems are in other parts of the design and those problems are the same for any traditional OoO processor.
Bobcat can sustain two integer operations and two memory operations per clock.
> > > > one way is using massive parallel
> > > > length decoding, another is to use predecode data and tag each byte. There have been
> > > > arguments that the later technique can scale up to 8 instructions decoded/clock with
> > > > most complexity being those things a RISC also need (tracking dependencies++).
> > >
> > > The big Intel cores use significant complexity to tackle the problem and they're stuck
> > > at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> > > on its target workloads). Not that this is attributable to decoder alone or x86 tax
> > > at all necessarily, but just to head off any claim of it being a furnace.
> >
> > Intel haven't tried to extend decoding beyond 4 instructions/cycle
>
> How do you know?
Because they haven't.
> > so I don't know what
> > you mean here. They have temporarily compensated the decode bandwidth with the µop cache
> > that have other advantages too inclusive lower power consumption for hot loops.
>
> For x86 decoders.
No.
> > The idea that going 8 wide isn't a problem is funny given the attempts
> > in the past and the well known scaling problems in the area.
>
> For x86 decoders.
*sigh*
Go read some studies in the area - strangely they tend to use RISC pipelines...
> >
> > > I don't know what you mean by "tracking dependencies++", but there is
> > > no indication that POWER8 uses a uop cache, so you're simply wrong.
> >
> > What? When decoding a non-explicit ISA one have to track dependencies between instruction slots.
>
> Rubbish. The only dependencies in instruction decoding are for x86, because variable length instructions
> means encoding of instruction depends on encoding of previous instructions. Fixed width has no such
> issues. Dual width has obviously dependant pairs, but that's obviously easier to scale up.
It is? Go earn some money then by describing how to do it.
> > The complexity
> > of doing that scales n^2 for normal techniques. The difference between a 8 issue x86 and an 8 issue RISC
> > is that the x86 have to split instructions into "lanes", after that the complexity is comparable.
> > It have nothing to do with µop caches.
> >
> > > >
> > > > >Also the x86 ISA is full of legacy instructions, which have to be implemented
> > > > > in hardware and then verified/tested which increases development costs and time of development.
> > > >
> > > > Wrong. Legacy instructions need some hardware, true. But most of the functionality
> > > > is implemented in microcode instead of adding complex hardware.
> > > > Now there are some quirks in the x86 ISA that does waste power like handling
> > > > of shift by zero, calculating the auxilary flag (nibble carry) etc. But those
> > > > are far from the most power consuming parts of an OoO processor core.
> > > >
> > > > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > > > development costs required to build an x86-based server chip based on a new micro-architecture.
> > > >
> > > > Now that's 100% true. X86 is a complex beast to implement
> > > > and much of the complexities aren't really documented.
> > > > But those undocumented things are used, knowingly or otherwise, and have to be supported.
> > > >
> > >
> > > Well, obviously they're well documented inside Intel and AMD nowadays, and that's all that really matters.
> > > That's not the cost of implementation. The cost is in the jet engines required to make the pig fly.
> >
> > Yeah... The problem for that argument that there are no jet engines. There are no special
> > things that only Intel can do and only does it to make the x86 ISA competitive.
>
> Yes there is. Big decoders, high performance microcode modes, complex decoded instruction caches,
> big and capable store forwarding, memory disambiguation, stack tracking, memory speculation, etc.
>
> Many of these things are good regardless of ISA, but also other ISAs did not have to implement them. POWER6
> had no store forwarding, POWER7 (I believe) did not have memory disambiguation, none of uop caches.
Look I don't know how to respond to this. Do you really think what you wrote above is true?
> > Look at the reality: x86 processors are among the highest performing with the lowest cost to comparative
> > performance alternatives.
>
> I would ask you to do the same thing: x86 processors are among the highest performing in space they
> have been targeting for the past several decades. From somewhere around Atom space to all the way down
> to ~10,000 gate microcontrollers, x86 is anywhere from uncompetitive to completely impossible.
>
> > It the ISA has overheads
>
> Oh, so it does have overheads now?
Who claimed otherwise? Most ISAs have overheads - even the Alpha had some (minor) warts.
> > but those are in percentages nowadays
>
> Evidence?
Go to news://comp.arch. Search for discussions in about x86 decode complexity or overheads. Get opinions and figures from people that have implemented both x86 and RISC.
> > which can be
> > more than compensated for by the development costs and larger chips the mass market can afford.
>
> Evidence? Certainly for smartphone space, the market says you're wrong.
No the market says that they want to design their own SoCs and have very low prices for their processor designs. Intel have traditionally not worked like that.
> Megol (golem960.delete@this.gmail.com) on August 9, 2014 4:22 am wrote:
> > anon (anon.delete@this.anon.com) on August 9, 2014 12:29 am wrote:
> > > Megol (golem960.delete@this.gmail.com) on August 8, 2014 11:23 am wrote:
> > > > juanrga (nospam.delete@this.juanrga.com) on August 8, 2014 10:49 am wrote:
> > > > > anon (anon.delete@this.anon.com) on August 6, 2014 7:54 pm wrote:
> > >
> > > > > > I have also heard from many people (it's possible this is just an uninformed 'echo chamber effect',
> > > > > > but I think there is some merit to the idea) that x86 cores take significantly more design skill
> > > > > > than an equivalent ARM core. Whether this is due to compatibility, or decoders, or necessity of
> > > > > > more capable memory pipline and caches, I don't know, but it seems to also be an x86 tax.
> > > > >
> > > > > E.g. a x86 decoder is more difficult to implement than an ARM64 decoder, because the former has to match
> > > > > instructions of variable length.
> > > >
> > > > True. But the way to handle this is well known nowadays,
> > >
> > > That's a complete non-point, and it does not mean that no disadvantage exists. You could just
> > > as well say that Intel "handed this well" with the Pentium or 386, for some values of "well".
> > >
> > > Atoms are 2 wide, even the SMT Atom is only 2 wide! While ARM went to 3 wide rather easily.
> >
> > First: you can't directly compare the CISC ISA width with the RISC one.
>
> First: you can. Easily. Because you know that ARM dynamic instruction count is quite comparable
> to x86, and even the paper being discussed shows that (except for some exceptions that are due
> to x86 microcoded instructions like the transcendentals on SPECfp, or poor code generation).
>
> Dynamic instruction count to solve a given problem is literally the final word in semantic expressiveness
> of instructions. If the ARM decoder can take fewer cycles to decode a compiled program than x86,
> then it objectively has the better throughput for that case. (obviously there are many more variables
> for "goodness", but simply talking about decode width in this paragraph).
Of course. However in practice that advantage gets dwarfed by other design choices.
Add a bit better branch predictor and suddenly doing a multi-cycle decode isn't a problem.
Add some mechanism to avoid re-decoding hot parts of programs -> branch latency of mispredicted branches aren't a problem.
And those things are better for all code in other ways.
> > Second: 2 wide
> > is sort of a sweet spot - going wider is expending resources with diminishing returns.
>
> Of course, and again, a non-statement. 2-wide *is* a sweet
> spot at that level... for x86. Because decoding is hard.
No. Because it is the sweet spot. Go read some papers on scaling of processor structures.
> >
> > The AMD Bobcat and relatives are also officially 2 wide decode. But it can do the same
> > work as 4 RISC instructions, two integer operations and two memory operations.
>
> Look, obviously just taking datapoints and making assumptions/extrapolations from there is never going to give
> conclusive evidence one way or the other. That said, if you think x86 decoding is _not_ a notable disadvantage
> for the ISA, then you're just completely out to lunch, and against what any credible person I've ever seen
> (not me, an anonymous internet poster is not credible, but former Intel engineers for example, are).
Of course it is a disadvantage, can you point out where I said(wrote) the opposite?
But for a realistic design the disadvantage isn't notable - the problems are in other parts of the design and those problems are the same for any traditional OoO processor.
Bobcat can sustain two integer operations and two memory operations per clock.
> > > > one way is using massive parallel
> > > > length decoding, another is to use predecode data and tag each byte. There have been
> > > > arguments that the later technique can scale up to 8 instructions decoded/clock with
> > > > most complexity being those things a RISC also need (tracking dependencies++).
> > >
> > > The big Intel cores use significant complexity to tackle the problem and they're stuck
> > > at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> > > on its target workloads). Not that this is attributable to decoder alone or x86 tax
> > > at all necessarily, but just to head off any claim of it being a furnace.
> >
> > Intel haven't tried to extend decoding beyond 4 instructions/cycle
>
> How do you know?
Because they haven't.
> > so I don't know what
> > you mean here. They have temporarily compensated the decode bandwidth with the µop cache
> > that have other advantages too inclusive lower power consumption for hot loops.
>
> For x86 decoders.
No.
> > The idea that going 8 wide isn't a problem is funny given the attempts
> > in the past and the well known scaling problems in the area.
>
> For x86 decoders.
*sigh*
Go read some studies in the area - strangely they tend to use RISC pipelines...
> >
> > > I don't know what you mean by "tracking dependencies++", but there is
> > > no indication that POWER8 uses a uop cache, so you're simply wrong.
> >
> > What? When decoding a non-explicit ISA one have to track dependencies between instruction slots.
>
> Rubbish. The only dependencies in instruction decoding are for x86, because variable length instructions
> means encoding of instruction depends on encoding of previous instructions. Fixed width has no such
> issues. Dual width has obviously dependant pairs, but that's obviously easier to scale up.
It is? Go earn some money then by describing how to do it.
> > The complexity
> > of doing that scales n^2 for normal techniques. The difference between a 8 issue x86 and an 8 issue RISC
> > is that the x86 have to split instructions into "lanes", after that the complexity is comparable.
> > It have nothing to do with µop caches.
> >
> > > >
> > > > >Also the x86 ISA is full of legacy instructions, which have to be implemented
> > > > > in hardware and then verified/tested which increases development costs and time of development.
> > > >
> > > > Wrong. Legacy instructions need some hardware, true. But most of the functionality
> > > > is implemented in microcode instead of adding complex hardware.
> > > > Now there are some quirks in the x86 ISA that does waste power like handling
> > > > of shift by zero, calculating the auxilary flag (nibble carry) etc. But those
> > > > are far from the most power consuming parts of an OoO processor core.
> > > >
> > > > > According to Feldman an entirely custom server chip using the ARM architecture takes about 18 months
> > > > > and about $30 million. By contrast, it takes three or four-year time frame and $300--400 million in
> > > > > development costs required to build an x86-based server chip based on a new micro-architecture.
> > > >
> > > > Now that's 100% true. X86 is a complex beast to implement
> > > > and much of the complexities aren't really documented.
> > > > But those undocumented things are used, knowingly or otherwise, and have to be supported.
> > > >
> > >
> > > Well, obviously they're well documented inside Intel and AMD nowadays, and that's all that really matters.
> > > That's not the cost of implementation. The cost is in the jet engines required to make the pig fly.
> >
> > Yeah... The problem for that argument that there are no jet engines. There are no special
> > things that only Intel can do and only does it to make the x86 ISA competitive.
>
> Yes there is. Big decoders, high performance microcode modes, complex decoded instruction caches,
> big and capable store forwarding, memory disambiguation, stack tracking, memory speculation, etc.
>
> Many of these things are good regardless of ISA, but also other ISAs did not have to implement them. POWER6
> had no store forwarding, POWER7 (I believe) did not have memory disambiguation, none of uop caches.
Look I don't know how to respond to this. Do you really think what you wrote above is true?
> > Look at the reality: x86 processors are among the highest performing with the lowest cost to comparative
> > performance alternatives.
>
> I would ask you to do the same thing: x86 processors are among the highest performing in space they
> have been targeting for the past several decades. From somewhere around Atom space to all the way down
> to ~10,000 gate microcontrollers, x86 is anywhere from uncompetitive to completely impossible.
>
> > It the ISA has overheads
>
> Oh, so it does have overheads now?
Who claimed otherwise? Most ISAs have overheads - even the Alpha had some (minor) warts.
> > but those are in percentages nowadays
>
> Evidence?
Go to news://comp.arch. Search for discussions in about x86 decode complexity or overheads. Get opinions and figures from people that have implemented both x86 and RISC.
> > which can be
> > more than compensated for by the development costs and larger chips the mass market can afford.
>
> Evidence? Certainly for smartphone space, the market says you're wrong.
No the market says that they want to design their own SoCs and have very low prices for their processor designs. Intel have traditionally not worked like that.