By: Megol (golem960.delete@this.gmail.com), August 12, 2014 5:48 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on August 11, 2014 8:46 pm wrote:
> Megol (golem960.delete@this.gmail.com) on August 11, 2014 8:30 am wrote:
> > anon (anon.delete@this.anon.com) on August 11, 2014 3:56 am wrote:
> > > Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 11:30 pm wrote:
> > > > anon (anon.delete@this.anon.com) on August 10, 2014 5:27 pm wrote:
> > > > > Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 3:11 am wrote:
> > > > > > anon (anon.delete@this.anon.com) on August 9, 2014 12:29 am wrote:
> > > > > > >
> > > > > > > The big Intel cores use significant complexity to tackle the problem and they're stuck
> > > > > > > at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> > > > > > > on its target workloads).
> > > > > >
> > > > > > "almost certainly" is way to strong a statement. It's possible, yes. But so far we have zero evidence.
> > > > >
> > > > > We have non-zero evidence. Not complete, but there is evidence.
> > > > >
> > > >
> > > > If I am not mistaken, all we have now are very impressive 4x6-core Power8 SAP SD 2-tier scores that still
> > > > lose in absolute numbers to 4x15-core Intel and approximately matches die-for-die 16x16-core Fujitsu.
> > > > We don't know which system between the three consumes less power under load, not even approximately.
> > >
> > > Well, we have some power specifications from IBM and Intel
> > > too, although granted if looking at system power,
> > > or even just taking into account the external memory controllers on POWER8, we don't know for sure.
> > >
> > > >
> > > > > >
> > > > > > > Not that this is attributable to decoder alone or x86 tax
> > > > > > > at all necessarily, but just to head off any claim of it being a furnace.
> > > > > > >
> > > > > > > I don't know what you mean by "tracking dependencies++", but there is
> > > > > > > no indication that POWER8 uses a uop cache, so you're simply wrong.
> > > > > > >
> > > > > >
> > > > > > Tracking dependencies withing group of instructions that
> > > > > > are renamed in parallel. Conventional wisdom says that
> > > > > > it has complexity of O(width^2). May be there was algorithmic breakthrough in this area, I don't know...
> > > > >
> > > > > That has nothing to do with decoding stage, however.
> > > > >
> > > >
> > > > The context was practical limits of the width of in-order front end of OoO cores.
> > >
> > > No, it was, very specifically, the decoding cost. My comment was the decoding
> > > cost was higher, and the response was something along the lines of "not really
> > > because all CPUs have to track dependencies anyway", which is just stupid.
> > >
> > > Decoding cost of x86 is higher than most other ISAs, particularly paralllel decoding.
> >
> > Then you again are 100% wrong.
> > The problem: determine instruction lengths so that work can be divided to separate decoder "lanes".
> > The solution: use predecode data stored in the instruction cache, 1 bit per stored byte is enough
> > to make 8 wide or more possible. More predecode bits can help in lowering scaling costs.
>
> Oh wow you solved everything. Lucky x86 has zero other decoding difficulties except for variable lengths.
>
Yes as far as I know. But I guess you are willing to give us all other problems?
BTW: Things are only a problem if they hinder scaling when running actual code. If not I'm still correct.
> Megol (golem960.delete@this.gmail.com) on August 11, 2014 8:30 am wrote:
> > anon (anon.delete@this.anon.com) on August 11, 2014 3:56 am wrote:
> > > Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 11:30 pm wrote:
> > > > anon (anon.delete@this.anon.com) on August 10, 2014 5:27 pm wrote:
> > > > > Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 3:11 am wrote:
> > > > > > anon (anon.delete@this.anon.com) on August 9, 2014 12:29 am wrote:
> > > > > > >
> > > > > > > The big Intel cores use significant complexity to tackle the problem and they're stuck
> > > > > > > at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> > > > > > > on its target workloads).
> > > > > >
> > > > > > "almost certainly" is way to strong a statement. It's possible, yes. But so far we have zero evidence.
> > > > >
> > > > > We have non-zero evidence. Not complete, but there is evidence.
> > > > >
> > > >
> > > > If I am not mistaken, all we have now are very impressive 4x6-core Power8 SAP SD 2-tier scores that still
> > > > lose in absolute numbers to 4x15-core Intel and approximately matches die-for-die 16x16-core Fujitsu.
> > > > We don't know which system between the three consumes less power under load, not even approximately.
> > >
> > > Well, we have some power specifications from IBM and Intel
> > > too, although granted if looking at system power,
> > > or even just taking into account the external memory controllers on POWER8, we don't know for sure.
> > >
> > > >
> > > > > >
> > > > > > > Not that this is attributable to decoder alone or x86 tax
> > > > > > > at all necessarily, but just to head off any claim of it being a furnace.
> > > > > > >
> > > > > > > I don't know what you mean by "tracking dependencies++", but there is
> > > > > > > no indication that POWER8 uses a uop cache, so you're simply wrong.
> > > > > > >
> > > > > >
> > > > > > Tracking dependencies withing group of instructions that
> > > > > > are renamed in parallel. Conventional wisdom says that
> > > > > > it has complexity of O(width^2). May be there was algorithmic breakthrough in this area, I don't know...
> > > > >
> > > > > That has nothing to do with decoding stage, however.
> > > > >
> > > >
> > > > The context was practical limits of the width of in-order front end of OoO cores.
> > >
> > > No, it was, very specifically, the decoding cost. My comment was the decoding
> > > cost was higher, and the response was something along the lines of "not really
> > > because all CPUs have to track dependencies anyway", which is just stupid.
> > >
> > > Decoding cost of x86 is higher than most other ISAs, particularly paralllel decoding.
> >
> > Then you again are 100% wrong.
> > The problem: determine instruction lengths so that work can be divided to separate decoder "lanes".
> > The solution: use predecode data stored in the instruction cache, 1 bit per stored byte is enough
> > to make 8 wide or more possible. More predecode bits can help in lowering scaling costs.
>
> Oh wow you solved everything. Lucky x86 has zero other decoding difficulties except for variable lengths.
>
Yes as far as I know. But I guess you are willing to give us all other problems?
BTW: Things are only a problem if they hinder scaling when running actual code. If not I'm still correct.