By: EBFE (x.delete@this.y.com), August 6, 2012 3:09 am
Room: Moderated Discussions
Eric (eric.kjellen.delete@this.gmail.com) on August 5, 2012 10:37 am wrote:
> EBFE (x.delete@this.y.com) on August 4, 2012 6:37 am wrote:
> > Eric
> (eric.kjellen.delete@this.gmail.com) on August 3, 2012 7:59 am wrote:
> >
> >
> > EBFE (x.delete@this.y.com) on August 3, 2012 1:57 am wrote:
> >
> > > David
> > Kanter
> > >
> (dkanter.delete@this.realworldtech.com) on July 26, 2012 9:31 am
> >
> >
> > >
> > > wrote:
> > > > > > > > What is
> the
> > > >
> > > > > >
> > > >
> > > >
> > >
> > > > currently best
> > ratio of
> > > > >
> GPU to
> > > > >
> > > > CPU?
> > > >
> > >
> > >
> > > > >
> > > > > > That depends on
> your
> > >
> > workload.
> > > > > >
> > > > >
> >
> > > > > >
> >
> > > > > >
> > >
>
> > > > > > > Will
> > > > >
> > > the best
> future design be a
> > > >
> > > >
> > > >
> >
> > >
> > > > > > BlueGene/Q (for best I/O) driving an
> >
> >
> > > > >
> > >
> > > > >
> > > >
> > > optimum
> > > > > number of
> > GPUs
> > > (for
> best Compute)?
> > > > > > >
> > > >
> > >
>
> > > > > > If you
> > > look at
> > > > >
>
> > > > > >
> > the chart, it
> > > > > >
> >
> > > >
> > > should be clear that
> > BGQ stands alone and
>
> > > > > doesn't need
> > > > >
> > > >
> >
>
> > > > GPUs. From a power
> > > > > > > perspective,
> it's
> > superior
> > > to
> > > > > all
> > > >
> existing GPUs - and
> > >
> > > > > it can run a far
> >
> > wider
> > > > > > > range
> > of
> > > >
>
> > > > > workloads.
> > > > >
> > > > >
>
> >
> > > > > > > David
> > > > > >
> >
> > > > >
> >
> > > >
> > > BGQ is
> > > >
> > > about the
> > > > > same
> > as 3GiB HD7970(947G/250W).
> So
> > > BGQ is
> > > > quite likely to be
> > inferior
> >
> > > > > to
> > > > > a 6GiB
> > > incoming-gen
>
> >
> > > > Firestream.
> > > > > > [If I recall
> correctly, at
> > >
> > launch time AMD
> > > > >
> >
> > > >
> > > > > listed 7970
> > GPU/board power
> >
> > as 210/250]
> > > > >
> > > > > I expect
> > that
> the next
> > > >
> > > > >
> > > generation of
> throughput
> > processors (Knights Corner, Tahiti, Kepler) will
> > >
> >
> > >
> > > >
> > > significantly alter the landscape
> because they represent a jump in
> > >
> >
> > > > >
> >
> > > process.
> > > > >
> > > > > For AMD
> > and
> Nvidia it is 40nm
> > > to 28nm, and for Intel it is
> > > > the
> move
> >
> > > > > to 22nm.
> > > > >
> > >
>
> > > > > DK
> > > >
> > ow">
> > > >
> >
> >
> >
> http://vr-zone.com/articles/intel-xeon-phi-b0-stepping--the-knight-in-shin
> >
> >
> > >
> > > ing-armor-/16871.html
> > > > (if not fake)
> Looks pretty bad: low
> > freq, small ram,
> > > high
> > > >
> tdp
> > > > The number is so bad that
> > I tend to think it's
> fake.
> > >
> > > Why
> > > do you think that? Assuming
> >
> that the top clock frequencies can actually be
> > > sustained, it's 8 DP
> FLOP
> > (512-bit vector unit) * 2 (FMA?) * 60 * 1.09 GHz =
> > > 1.046
> TFLOPS. That's
> > in line with the 1 DP TFLOPS that Knights Corner was
>
> > > reported to clock in
> > at in DGEMM/LINPACK. TDP and RAM look
> OK.
> >
> > Linpack TFlops/300W is likely to
> > defeat next Firestream
> on perf and perf/w, or even next Fermi by
> > perf.
> >
> > However, I
> am assuming the following disadvantage, so I was expecting
> > more flops and
> flops/W.
> > 1. Coding for a target performance level is harder for
> > KNC
> than GPU.
>
> I strongly suspect that the exact opposite is true.
>
Coding would definitely be easier for KNC than GPU, if your target is e.g. 40% efficiency.
However, suppose big-K is 2TFlops.
Is it easier to code KNC for 90%, than GPU for 50%? (thus same performance)
The good thing is that KNC seems to have more bandwidth, so it might be true.
> > KNC
> has to close the gap of low raw flops by significantly
> > higher effciency,
> which could be impossible or make coding harder.
>
> Raw DP FLOPS are much higher
> than the current competition (Tesla M2090 rates at 665 GFLOPS, presumably in
> LINPACK, and FireStream 9370 at 528 DP GFLOPS). Tesla K20 (big Kepler) at 28nm
> will definitely beat KNC though, as far as I know performance targets are at
> 1.5-2 DP TFLOPS.
>
665GFlops/M2090 is theoretical.
At launch time, Fermi Linpack is ~56%. So it's likely <=370G for M2090.
Assuming unchanged efficiency, big-K is also around TFlops Linpack.
> EBFE (x.delete@this.y.com) on August 4, 2012 6:37 am wrote:
> > Eric
> (eric.kjellen.delete@this.gmail.com) on August 3, 2012 7:59 am wrote:
> >
> >
> > EBFE (x.delete@this.y.com) on August 3, 2012 1:57 am wrote:
> >
> > > David
> > Kanter
> > >
> (dkanter.delete@this.realworldtech.com) on July 26, 2012 9:31 am
> >
> >
> > >
> > > wrote:
> > > > > > > > What is
> the
> > > >
> > > > > >
> > > >
> > > >
> > >
> > > > currently best
> > ratio of
> > > > >
> GPU to
> > > > >
> > > > CPU?
> > > >
> > >
> > >
> > > > >
> > > > > > That depends on
> your
> > >
> > workload.
> > > > > >
> > > > >
> >
> > > > > >
> >
> > > > > >
> > >
>
> > > > > > > Will
> > > > >
> > > the best
> future design be a
> > > >
> > > >
> > > >
> >
> > >
> > > > > > BlueGene/Q (for best I/O) driving an
> >
> >
> > > > >
> > >
> > > > >
> > > >
> > > optimum
> > > > > number of
> > GPUs
> > > (for
> best Compute)?
> > > > > > >
> > > >
> > >
>
> > > > > > If you
> > > look at
> > > > >
>
> > > > > >
> > the chart, it
> > > > > >
> >
> > > >
> > > should be clear that
> > BGQ stands alone and
>
> > > > > doesn't need
> > > > >
> > > >
> >
>
> > > > GPUs. From a power
> > > > > > > perspective,
> it's
> > superior
> > > to
> > > > > all
> > > >
> existing GPUs - and
> > >
> > > > > it can run a far
> >
> > wider
> > > > > > > range
> > of
> > > >
>
> > > > > workloads.
> > > > >
> > > > >
>
> >
> > > > > > > David
> > > > > >
> >
> > > > >
> >
> > > >
> > > BGQ is
> > > >
> > > about the
> > > > > same
> > as 3GiB HD7970(947G/250W).
> So
> > > BGQ is
> > > > quite likely to be
> > inferior
> >
> > > > > to
> > > > > a 6GiB
> > > incoming-gen
>
> >
> > > > Firestream.
> > > > > > [If I recall
> correctly, at
> > >
> > launch time AMD
> > > > >
> >
> > > >
> > > > > listed 7970
> > GPU/board power
> >
> > as 210/250]
> > > > >
> > > > > I expect
> > that
> the next
> > > >
> > > > >
> > > generation of
> throughput
> > processors (Knights Corner, Tahiti, Kepler) will
> > >
> >
> > >
> > > >
> > > significantly alter the landscape
> because they represent a jump in
> > >
> >
> > > > >
> >
> > > process.
> > > > >
> > > > > For AMD
> > and
> Nvidia it is 40nm
> > > to 28nm, and for Intel it is
> > > > the
> move
> >
> > > > > to 22nm.
> > > > >
> > >
>
> > > > > DK
> > > >
> > ow">
> > > >
> >
> >
> >
> http://vr-zone.com/articles/intel-xeon-phi-b0-stepping--the-knight-in-shin
> >
> >
> > >
> > > ing-armor-/16871.html
> > > > (if not fake)
> Looks pretty bad: low
> > freq, small ram,
> > > high
> > > >
> tdp
> > > > The number is so bad that
> > I tend to think it's
> fake.
> > >
> > > Why
> > > do you think that? Assuming
> >
> that the top clock frequencies can actually be
> > > sustained, it's 8 DP
> FLOP
> > (512-bit vector unit) * 2 (FMA?) * 60 * 1.09 GHz =
> > > 1.046
> TFLOPS. That's
> > in line with the 1 DP TFLOPS that Knights Corner was
>
> > > reported to clock in
> > at in DGEMM/LINPACK. TDP and RAM look
> OK.
> >
> > Linpack TFlops/300W is likely to
> > defeat next Firestream
> on perf and perf/w, or even next Fermi by
> > perf.
> >
> > However, I
> am assuming the following disadvantage, so I was expecting
> > more flops and
> flops/W.
> > 1. Coding for a target performance level is harder for
> > KNC
> than GPU.
>
> I strongly suspect that the exact opposite is true.
>
Coding would definitely be easier for KNC than GPU, if your target is e.g. 40% efficiency.
However, suppose big-K is 2TFlops.
Is it easier to code KNC for 90%, than GPU for 50%? (thus same performance)
The good thing is that KNC seems to have more bandwidth, so it might be true.
> > KNC
> has to close the gap of low raw flops by significantly
> > higher effciency,
> which could be impossible or make coding harder.
>
> Raw DP FLOPS are much higher
> than the current competition (Tesla M2090 rates at 665 GFLOPS, presumably in
> LINPACK, and FireStream 9370 at 528 DP GFLOPS). Tesla K20 (big Kepler) at 28nm
> will definitely beat KNC though, as far as I know performance targets are at
> 1.5-2 DP TFLOPS.
>
665GFlops/M2090 is theoretical.
At launch time, Fermi Linpack is ~56%. So it's likely <=370G for M2090.
Assuming unchanged efficiency, big-K is also around TFlops Linpack.



