By: Michael S (already5chosen.delete@this.yahoo.com), August 11, 2014 5:23 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on August 11, 2014 3:56 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 11:30 pm wrote:
> > anon (anon.delete@this.anon.com) on August 10, 2014 5:27 pm wrote:
> > > Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 3:11 am wrote:
> > > > anon (anon.delete@this.anon.com) on August 9, 2014 12:29 am wrote:
> > > > >
> > > > > The big Intel cores use significant complexity to tackle the problem and they're stuck
> > > > > at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> > > > > on its target workloads).
> > > >
> > > > "almost certainly" is way to strong a statement. It's possible, yes. But so far we have zero evidence.
> > >
> > > We have non-zero evidence. Not complete, but there is evidence.
> > >
> >
> > If I am not mistaken, all we have now are very impressive 4x6-core Power8 SAP SD 2-tier scores that still
> > lose in absolute numbers to 4x15-core Intel and approximately matches die-for-die 16x16-core Fujitsu.
> > We don't know which system between the three consumes less power under load, not even approximately.
>
> Well, we have some power specifications from IBM and Intel too, although granted if looking at system power,
Do we really have power specifications from IBM? Including 6-core parts?
As to Intel, we have TDP, but we also know from experience that Intel CPUs tend to approach and sometimes exceed TDP when running high-IPC workloads, esp. those that are SIMD-heavy. On the other hand, when Intel CPUs run low-IPC integer-only code they tend to stay well below TDP even when CPU load monitor shows 100% all the time.
The one of the latest SPECpower_ssj2008 submissions is a good example:
The whole 4xE5-4650 v2 server, with memory and disks (not heavily used in this particular benchmark, but still...) and network card and fans and not 100% efficient power supply at full load consumes 374 Watts, i.e. less than 380W of TDP of CPUs alone! And that's while running SPECpower_ssj2008, which has significantly higher IPC than SAP SD 2-tier.
http://www.spec.org/power_ssj2008/results/res2014q2/power_ssj2008-20140401-00654.html
> or even just taking into account the external memory controllers on POWER8, we don't know for sure.
In general Power8 has higher-throughput memory subsystem, that, when running at full gear, certainly consumes sevwral times more power (I'd guess, at least 4x) than memory subsystem of IvyBridge-EX. I didn't look at # of channels that are activated in low-end box that was benchmarked buy SAP SD, however I would be surprised if it's memory controllers + memory I/O buffers do not consume 1.5 times more than IVB-EX.
As to LLC, IBM's 6-core has 48 MB of EDRAM vs Intel's 37.5 MB of SRAM. In theory, that should give IBM significant advantage in dynamic power consumption. The cost they pay in static power should be negligible in heavy load scenario.
>
> >
> > > >
> > > > > Not that this is attributable to decoder alone or x86 tax
> > > > > at all necessarily, but just to head off any claim of it being a furnace.
> > > > >
> > > > > I don't know what you mean by "tracking dependencies++", but there is
> > > > > no indication that POWER8 uses a uop cache, so you're simply wrong.
> > > > >
> > > >
> > > > Tracking dependencies withing group of instructions that
> > > > are renamed in parallel. Conventional wisdom says that
> > > > it has complexity of O(width^2). May be there was algorithmic breakthrough in this area, I don't know...
> > >
> > > That has nothing to do with decoding stage, however.
> > >
> >
> > The context was practical limits of the width of in-order front end of OoO cores.
>
> No, it was, very specifically, the decoding cost. My comment was the decoding
> cost was higher, and the response was something along the lines of "not really
> because all CPUs have to track dependencies anyway", which is just stupid.
>
> Decoding cost of x86 is higher than most other ISAs, particularly paralllel decoding.
The point is, parallel decoding cost is *not* the likely reason behind Intel's decision to keep renamer 4-wide.
6 uOps/clock are quite often readily available from L0C$, however, as far as I remember, Intel does not try to utilize them for wider renaming. I take it as a hint (not proof) that their simulations show that wider renaming has negative effect on perf/watt.
> Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 11:30 pm wrote:
> > anon (anon.delete@this.anon.com) on August 10, 2014 5:27 pm wrote:
> > > Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 3:11 am wrote:
> > > > anon (anon.delete@this.anon.com) on August 9, 2014 12:29 am wrote:
> > > > >
> > > > > The big Intel cores use significant complexity to tackle the problem and they're stuck
> > > > > at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> > > > > on its target workloads).
> > > >
> > > > "almost certainly" is way to strong a statement. It's possible, yes. But so far we have zero evidence.
> > >
> > > We have non-zero evidence. Not complete, but there is evidence.
> > >
> >
> > If I am not mistaken, all we have now are very impressive 4x6-core Power8 SAP SD 2-tier scores that still
> > lose in absolute numbers to 4x15-core Intel and approximately matches die-for-die 16x16-core Fujitsu.
> > We don't know which system between the three consumes less power under load, not even approximately.
>
> Well, we have some power specifications from IBM and Intel too, although granted if looking at system power,
Do we really have power specifications from IBM? Including 6-core parts?
As to Intel, we have TDP, but we also know from experience that Intel CPUs tend to approach and sometimes exceed TDP when running high-IPC workloads, esp. those that are SIMD-heavy. On the other hand, when Intel CPUs run low-IPC integer-only code they tend to stay well below TDP even when CPU load monitor shows 100% all the time.
The one of the latest SPECpower_ssj2008 submissions is a good example:
The whole 4xE5-4650 v2 server, with memory and disks (not heavily used in this particular benchmark, but still...) and network card and fans and not 100% efficient power supply at full load consumes 374 Watts, i.e. less than 380W of TDP of CPUs alone! And that's while running SPECpower_ssj2008, which has significantly higher IPC than SAP SD 2-tier.
http://www.spec.org/power_ssj2008/results/res2014q2/power_ssj2008-20140401-00654.html
> or even just taking into account the external memory controllers on POWER8, we don't know for sure.
In general Power8 has higher-throughput memory subsystem, that, when running at full gear, certainly consumes sevwral times more power (I'd guess, at least 4x) than memory subsystem of IvyBridge-EX. I didn't look at # of channels that are activated in low-end box that was benchmarked buy SAP SD, however I would be surprised if it's memory controllers + memory I/O buffers do not consume 1.5 times more than IVB-EX.
As to LLC, IBM's 6-core has 48 MB of EDRAM vs Intel's 37.5 MB of SRAM. In theory, that should give IBM significant advantage in dynamic power consumption. The cost they pay in static power should be negligible in heavy load scenario.
>
> >
> > > >
> > > > > Not that this is attributable to decoder alone or x86 tax
> > > > > at all necessarily, but just to head off any claim of it being a furnace.
> > > > >
> > > > > I don't know what you mean by "tracking dependencies++", but there is
> > > > > no indication that POWER8 uses a uop cache, so you're simply wrong.
> > > > >
> > > >
> > > > Tracking dependencies withing group of instructions that
> > > > are renamed in parallel. Conventional wisdom says that
> > > > it has complexity of O(width^2). May be there was algorithmic breakthrough in this area, I don't know...
> > >
> > > That has nothing to do with decoding stage, however.
> > >
> >
> > The context was practical limits of the width of in-order front end of OoO cores.
>
> No, it was, very specifically, the decoding cost. My comment was the decoding
> cost was higher, and the response was something along the lines of "not really
> because all CPUs have to track dependencies anyway", which is just stupid.
>
> Decoding cost of x86 is higher than most other ISAs, particularly paralllel decoding.
The point is, parallel decoding cost is *not* the likely reason behind Intel's decision to keep renamer 4-wide.
6 uOps/clock are quite often readily available from L0C$, however, as far as I remember, Intel does not try to utilize them for wider renaming. I take it as a hint (not proof) that their simulations show that wider renaming has negative effect on perf/watt.