By: anon (anon.delete@this.anon.com), August 11, 2014 6:29 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on August 11, 2014 5:23 am wrote:
> anon (anon.delete@this.anon.com) on August 11, 2014 3:56 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 11:30 pm wrote:
> > > anon (anon.delete@this.anon.com) on August 10, 2014 5:27 pm wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 3:11 am wrote:
> > > > > anon (anon.delete@this.anon.com) on August 9, 2014 12:29 am wrote:
> > > > > >
> > > > > > The big Intel cores use significant complexity to tackle the problem and they're stuck
> > > > > > at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> > > > > > on its target workloads).
> > > > >
> > > > > "almost certainly" is way to strong a statement. It's possible, yes. But so far we have zero evidence.
> > > >
> > > > We have non-zero evidence. Not complete, but there is evidence.
> > > >
> > >
> > > If I am not mistaken, all we have now are very impressive 4x6-core Power8 SAP SD 2-tier scores that still
> > > lose in absolute numbers to 4x15-core Intel and approximately matches die-for-die 16x16-core Fujitsu.
> > > We don't know which system between the three consumes less power under load, not even approximately.
> >
> > Well, we have some power specifications from IBM and Intel
> > too, although granted if looking at system power,
>
> Do we really have power specifications from IBM? Including 6-core parts?
Well you can extrapolate without getting too far off track.
>
> As to Intel, we have TDP, but we also know from experience that Intel CPUs tend to
> approach and sometimes exceed TDP when running high-IPC workloads, esp. those that
> are SIMD-heavy. On the other hand, when Intel CPUs run low-IPC integer-only code they
> tend to stay well below TDP even when CPU load monitor shows 100% all the time.
> The one of the latest SPECpower_ssj2008 submissions is a good example:
> The whole 4xE5-4650 v2 server, with memory and disks (not heavily used in this particular
> benchmark, but still...) and network card and fans and not 100% efficient power supply at
> full load consumes 374 Watts, i.e. less than 380W of TDP of CPUs alone! And that's while
> running SPECpower_ssj2008, which has significantly higher IPC than SAP SD 2-tier.
>
> http://www.spec.org/power_ssj2008/results/res2014q2/power_ssj2008-20140401-00654.html
>
> > or even just taking into account the external memory controllers on POWER8, we don't know for sure.
>
> In general Power8 has higher-throughput memory subsystem, that, when running at full gear, certainly consumes
> sevwral times more power (I'd guess, at least 4x) than memory subsystem of IvyBridge-EX. I didn't look at
> # of channels that are activated in low-end box that was benchmarked buy SAP SD, however I would be surprised
> if it's memory controllers + memory I/O buffers do not consume 1.5 times more than IVB-EX.
> As to LLC, IBM's 6-core has 48 MB of EDRAM vs Intel's 37.5 MB of SRAM. In theory,
> that should give IBM significant advantage in dynamic power consumption. The cost
> they pay in static power should be negligible in heavy load scenario.
But you're right, there are too many variables and my statement was too strong.
Put a better way, there is no longer ability for naysayers to claim that POWER has poor perf/watt.
>
> >
> > >
> > > > >
> > > > > > Not that this is attributable to decoder alone or x86 tax
> > > > > > at all necessarily, but just to head off any claim of it being a furnace.
> > > > > >
> > > > > > I don't know what you mean by "tracking dependencies++", but there is
> > > > > > no indication that POWER8 uses a uop cache, so you're simply wrong.
> > > > > >
> > > > >
> > > > > Tracking dependencies withing group of instructions that
> > > > > are renamed in parallel. Conventional wisdom says that
> > > > > it has complexity of O(width^2). May be there was algorithmic breakthrough in this area, I don't know...
> > > >
> > > > That has nothing to do with decoding stage, however.
> > > >
> > >
> > > The context was practical limits of the width of in-order front end of OoO cores.
> >
> > No, it was, very specifically, the decoding cost. My comment was the decoding
> > cost was higher, and the response was something along the lines of "not really
> > because all CPUs have to track dependencies anyway", which is just stupid.
> >
> > Decoding cost of x86 is higher than most other ISAs, particularly paralllel decoding.
>
> The point is, parallel decoding cost is *not* the likely
> reason behind Intel's decision to keep renamer 4-wide.
> 6 uOps/clock are quite often readily available from L0C$, however, as far as I remember,
> Intel does not try to utilize them for wider renaming. I take it as a hint (not proof)
> that their simulations show that wider renaming has negative effect on perf/watt.
No, but we're not talking about the renamer. We're talking about the decoder. And the cost of decoding *was* the likely reason behind the uop cache. That's not to say ARM or POWER implementations could not ever see any benefit from a uop cache, but Intel's fondness of it on its wide high performance microarchitecture does give some hint of higher decoding cost for x86.
> anon (anon.delete@this.anon.com) on August 11, 2014 3:56 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 11:30 pm wrote:
> > > anon (anon.delete@this.anon.com) on August 10, 2014 5:27 pm wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on August 10, 2014 3:11 am wrote:
> > > > > anon (anon.delete@this.anon.com) on August 9, 2014 12:29 am wrote:
> > > > > >
> > > > > > The big Intel cores use significant complexity to tackle the problem and they're stuck
> > > > > > at 4. POWER has reached 8 without problems (with almost certainly better throughput/watt
> > > > > > on its target workloads).
> > > > >
> > > > > "almost certainly" is way to strong a statement. It's possible, yes. But so far we have zero evidence.
> > > >
> > > > We have non-zero evidence. Not complete, but there is evidence.
> > > >
> > >
> > > If I am not mistaken, all we have now are very impressive 4x6-core Power8 SAP SD 2-tier scores that still
> > > lose in absolute numbers to 4x15-core Intel and approximately matches die-for-die 16x16-core Fujitsu.
> > > We don't know which system between the three consumes less power under load, not even approximately.
> >
> > Well, we have some power specifications from IBM and Intel
> > too, although granted if looking at system power,
>
> Do we really have power specifications from IBM? Including 6-core parts?
Well you can extrapolate without getting too far off track.
>
> As to Intel, we have TDP, but we also know from experience that Intel CPUs tend to
> approach and sometimes exceed TDP when running high-IPC workloads, esp. those that
> are SIMD-heavy. On the other hand, when Intel CPUs run low-IPC integer-only code they
> tend to stay well below TDP even when CPU load monitor shows 100% all the time.
> The one of the latest SPECpower_ssj2008 submissions is a good example:
> The whole 4xE5-4650 v2 server, with memory and disks (not heavily used in this particular
> benchmark, but still...) and network card and fans and not 100% efficient power supply at
> full load consumes 374 Watts, i.e. less than 380W of TDP of CPUs alone! And that's while
> running SPECpower_ssj2008, which has significantly higher IPC than SAP SD 2-tier.
>
> http://www.spec.org/power_ssj2008/results/res2014q2/power_ssj2008-20140401-00654.html
>
> > or even just taking into account the external memory controllers on POWER8, we don't know for sure.
>
> In general Power8 has higher-throughput memory subsystem, that, when running at full gear, certainly consumes
> sevwral times more power (I'd guess, at least 4x) than memory subsystem of IvyBridge-EX. I didn't look at
> # of channels that are activated in low-end box that was benchmarked buy SAP SD, however I would be surprised
> if it's memory controllers + memory I/O buffers do not consume 1.5 times more than IVB-EX.
> As to LLC, IBM's 6-core has 48 MB of EDRAM vs Intel's 37.5 MB of SRAM. In theory,
> that should give IBM significant advantage in dynamic power consumption. The cost
> they pay in static power should be negligible in heavy load scenario.
But you're right, there are too many variables and my statement was too strong.
Put a better way, there is no longer ability for naysayers to claim that POWER has poor perf/watt.
>
> >
> > >
> > > > >
> > > > > > Not that this is attributable to decoder alone or x86 tax
> > > > > > at all necessarily, but just to head off any claim of it being a furnace.
> > > > > >
> > > > > > I don't know what you mean by "tracking dependencies++", but there is
> > > > > > no indication that POWER8 uses a uop cache, so you're simply wrong.
> > > > > >
> > > > >
> > > > > Tracking dependencies withing group of instructions that
> > > > > are renamed in parallel. Conventional wisdom says that
> > > > > it has complexity of O(width^2). May be there was algorithmic breakthrough in this area, I don't know...
> > > >
> > > > That has nothing to do with decoding stage, however.
> > > >
> > >
> > > The context was practical limits of the width of in-order front end of OoO cores.
> >
> > No, it was, very specifically, the decoding cost. My comment was the decoding
> > cost was higher, and the response was something along the lines of "not really
> > because all CPUs have to track dependencies anyway", which is just stupid.
> >
> > Decoding cost of x86 is higher than most other ISAs, particularly paralllel decoding.
>
> The point is, parallel decoding cost is *not* the likely
> reason behind Intel's decision to keep renamer 4-wide.
> 6 uOps/clock are quite often readily available from L0C$, however, as far as I remember,
> Intel does not try to utilize them for wider renaming. I take it as a hint (not proof)
> that their simulations show that wider renaming has negative effect on perf/watt.
No, but we're not talking about the renamer. We're talking about the decoder. And the cost of decoding *was* the likely reason behind the uop cache. That's not to say ARM or POWER implementations could not ever see any benefit from a uop cache, but Intel's fondness of it on its wide high performance microarchitecture does give some hint of higher decoding cost for x86.