By: Eric (eric.kjellen.delete@this.gmail.com), August 7, 2012 4:58 am
Room: Moderated Discussions
jp (asdfasdf.delete@this.gmail.com) on August 7, 2012 3:08 am wrote:
> aaron spink (aaronspink.delete@this.notearthlink.net) on August 6, 2012 4:33 am
> wrote:
> > EBFE (x.delete@this.y.com) on August 6, 2012 3:09 am wrote:
> >
>
> > > Coding would
> > definitely be easier for KNC than GPU, if your
> target
> > > is e.g. 40%
> > efficiency.
> > > However, suppose
> big-K is 2TFlops.
> > > Is it easier to
> >
> > > code KNC for
> 90%, than GPU for 50%? (thus same performance)
> > > The
> > good thing
> is
> > > that KNC seems to have more bandwidth, so it might be
> >
> true.
> > >
> > It is unlikely that K20 is >1.5 TFlops. As of right
> now there
> > are no plans for the GK110 to be put in the consumer space, so
> they cannot rely
> > on binning good dies for the K product. They also have
> a thermal envelope to
> > fit into. So it is unlikely they'll be releasing
> at 1 Ghz+. What we do know
> > about K20 is that it is 15 SMX @ 64 DP per
> SMX for a best case of 1920 TFlops @
> > 1 Ghz. Thermals + large die will
> probably cost them 20-30% frequency putting
> > them in the range of 1.4
> TFlops. And even that is probably a bit generous since
> > it is unlikely
> that they'll be able enable all 15 SMX due to defects.
> >
> > And
> >
> yes, it is likely easier to code KNC for 90% than for GPU for 50% on average
>
> > across the relevant workloads. For one thing, all your code will compile
> on KNC
> > @ day 1 and KNC offer much more flexibility in the decomposing of a
> program.
> > KNC provides offload, symmetric, and host models. K20 only
> provides
> > offload.
> >
> > > 665GFlops/M2090 is theoretical.
> >
> > At launch time, Fermi
> > Linpack is ~56%. So
> > > it's likely
> Assuming unchanged efficiency, big-K is
> > also
> > > around TFlops
> Linpack.
> > >
> >
> > I'll put my stake down at ~800-900
> >
> Gflops linpack performance for K20. I think you are assuming too high of a
> peak
> > for K20. Even my 1.4 Tflop peak is >2x 2090.
>
> Just a note on
> the performance numbers mentioned before. From what I heard KNC does not have
> the FMA instruction.
>
> With the 1.09 Ghz that would land us a whopping 1.09 *
> 61 * (512/64 ) = 558 GFLOPS DP, behind the current competition, in other words
> it looks like KNC would be 2 years late to the market.
>
> On the K20 we can
> expect conservative clocks at maybe ~ 0.8 Ghz which would put us at 15*64*0.8*2
> => 1536 GFLOPs DP, way ahead of the competition... Given that GPUs are
> already hitting 80% for matrix operation applications we should be seeing at
> least 1.22 TFLOPs by a conservative estimate.
>
> More importantly, for real
> world applications we should be seeing 15 * 192 * 2 * 0.8 => 4608 SP GFLOPS.
> And yes, single precision is extremely important within for example image and
> signal processing applications.
>
> Cheers,
>
That's not correct, KNC/MIC has FMA support. Look at page 18 in the following PDF (at page 14 it also says that DP performance is 1 TFLOPS):
2012.04.25 Andrzej Nowak - An overview of Intel MIC - technology, hardware and software v3
> aaron spink (aaronspink.delete@this.notearthlink.net) on August 6, 2012 4:33 am
> wrote:
> > EBFE (x.delete@this.y.com) on August 6, 2012 3:09 am wrote:
> >
>
> > > Coding would
> > definitely be easier for KNC than GPU, if your
> target
> > > is e.g. 40%
> > efficiency.
> > > However, suppose
> big-K is 2TFlops.
> > > Is it easier to
> >
> > > code KNC for
> 90%, than GPU for 50%? (thus same performance)
> > > The
> > good thing
> is
> > > that KNC seems to have more bandwidth, so it might be
> >
> true.
> > >
> > It is unlikely that K20 is >1.5 TFlops. As of right
> now there
> > are no plans for the GK110 to be put in the consumer space, so
> they cannot rely
> > on binning good dies for the K product. They also have
> a thermal envelope to
> > fit into. So it is unlikely they'll be releasing
> at 1 Ghz+. What we do know
> > about K20 is that it is 15 SMX @ 64 DP per
> SMX for a best case of 1920 TFlops @
> > 1 Ghz. Thermals + large die will
> probably cost them 20-30% frequency putting
> > them in the range of 1.4
> TFlops. And even that is probably a bit generous since
> > it is unlikely
> that they'll be able enable all 15 SMX due to defects.
> >
> > And
> >
> yes, it is likely easier to code KNC for 90% than for GPU for 50% on average
>
> > across the relevant workloads. For one thing, all your code will compile
> on KNC
> > @ day 1 and KNC offer much more flexibility in the decomposing of a
> program.
> > KNC provides offload, symmetric, and host models. K20 only
> provides
> > offload.
> >
> > > 665GFlops/M2090 is theoretical.
> >
> > At launch time, Fermi
> > Linpack is ~56%. So
> > > it's likely
> Assuming unchanged efficiency, big-K is
> > also
> > > around TFlops
> Linpack.
> > >
> >
> > I'll put my stake down at ~800-900
> >
> Gflops linpack performance for K20. I think you are assuming too high of a
> peak
> > for K20. Even my 1.4 Tflop peak is >2x 2090.
>
> Just a note on
> the performance numbers mentioned before. From what I heard KNC does not have
> the FMA instruction.
>
> With the 1.09 Ghz that would land us a whopping 1.09 *
> 61 * (512/64 ) = 558 GFLOPS DP, behind the current competition, in other words
> it looks like KNC would be 2 years late to the market.
>
> On the K20 we can
> expect conservative clocks at maybe ~ 0.8 Ghz which would put us at 15*64*0.8*2
> => 1536 GFLOPs DP, way ahead of the competition... Given that GPUs are
> already hitting 80% for matrix operation applications we should be seeing at
> least 1.22 TFLOPs by a conservative estimate.
>
> More importantly, for real
> world applications we should be seeing 15 * 192 * 2 * 0.8 => 4608 SP GFLOPS.
> And yes, single precision is extremely important within for example image and
> signal processing applications.
>
> Cheers,
>
That's not correct, KNC/MIC has FMA support. Look at page 18 in the following PDF (at page 14 it also says that DP performance is 1 TFLOPS):
2012.04.25 Andrzej Nowak - An overview of Intel MIC - technology, hardware and software v3



