By: jp (asdfasdf.delete@this.gmail.com), August 7, 2012 5:17 am
Room: Moderated Discussions
Eric (eric.kjellen.delete@this.gmail.com) on August 7, 2012 4:58 am wrote:
> jp (asdfasdf.delete@this.gmail.com) on August 7, 2012 3:08 am wrote:
> > aaron
> spink (aaronspink.delete@this.notearthlink.net) on August 6, 2012 4:33 am
> >
> wrote:
> > > EBFE (x.delete@this.y.com) on August 6, 2012 3:09 am
> wrote:
> > >
> >
> > > > Coding would
> > > definitely be
> easier for KNC than GPU, if your
> > target
> > > > is e.g. 40%
>
> > > efficiency.
> > > > However, suppose
> > big-K is
> 2TFlops.
> > > > Is it easier to
> > >
> > > > code KNC
> for
> > 90%, than GPU for 50%? (thus same performance)
> > > > The
>
> > > good thing
> > is
> > > > that KNC seems to have more
> bandwidth, so it might be
> > >
> > true.
> > > >
> > > It
> is unlikely that K20 is >1.5 TFlops. As of right
> > now there
> >
> > are no plans for the GK110 to be put in the consumer space, so
> > they
> cannot rely
> > > on binning good dies for the K product. They also have
>
> > a thermal envelope to
> > > fit into. So it is unlikely they'll be
> releasing
> > at 1 Ghz+. What we do know
> > > about K20 is that it is
> 15 SMX @ 64 DP per
> > SMX for a best case of 1920 TFlops @
> > > 1
> Ghz. Thermals + large die will
> > probably cost them 20-30% frequency
> putting
> > > them in the range of 1.4
> > TFlops. And even that is
> probably a bit generous since
> > > it is unlikely
> > that they'll be
> able enable all 15 SMX due to defects.
> > >
> > > And
> > >
>
> > yes, it is likely easier to code KNC for 90% than for GPU for 50% on
> average
> >
> > > across the relevant workloads. For one thing, all
> your code will compile
> > on KNC
> > > @ day 1 and KNC offer much more
> flexibility in the decomposing of a
> > program.
> > > KNC provides
> offload, symmetric, and host models. K20 only
> > provides
> > >
> offload.
> > >
> > > > 665GFlops/M2090 is theoretical.
> >
> >
> > > At launch time, Fermi
> > > Linpack is ~56%. So
> >
> > > it's likely
> > Assuming unchanged efficiency, big-K is
> >
> > also
> > > > around TFlops
> > Linpack.
> > > >
> >
> >
> > > I'll put my stake down at ~800-900
> > >
> > Gflops
> linpack performance for K20. I think you are assuming too high of a
> >
> peak
> > > for K20. Even my 1.4 Tflop peak is >2x 2090.
> >
> >
> Just a note on
> > the performance numbers mentioned before. From what I
> heard KNC does not have
> > the FMA instruction.
> >
> > With the 1.09
> Ghz that would land us a whopping 1.09 *
> > 61 * (512/64 ) = 558 GFLOPS DP,
> behind the current competition, in other words
> > it looks like KNC would be
> 2 years late to the market.
> >
> > On the K20 we can
> > expect
> conservative clocks at maybe ~ 0.8 Ghz which would put us at 15*64*0.8*2
> >
> => 1536 GFLOPs DP, way ahead of the competition... Given that GPUs are
>
> > already hitting 80% for matrix operation applications we should be seeing
> at
> > least 1.22 TFLOPs by a conservative estimate.
> >
> > More
> importantly, for real
> > world applications we should be seeing 15 * 192 * 2
> * 0.8 => 4608 SP GFLOPS.
> > And yes, single precision is extremely
> important within for example image and
> > signal processing
> applications.
> >
> > Cheers,
> >
>
> That's not correct, KNC/MIC has FMA
> support. Look at page 18 in the following PDF (at page 14 it also says that DP
> performance is 1 TFLOPS):
>
> 2012.04.25 Andrzej Nowak - An overview of Intel MIC
> - technology, hardware and software v3
Thanks for the source! Have been searching for this.
> jp (asdfasdf.delete@this.gmail.com) on August 7, 2012 3:08 am wrote:
> > aaron
> spink (aaronspink.delete@this.notearthlink.net) on August 6, 2012 4:33 am
> >
> wrote:
> > > EBFE (x.delete@this.y.com) on August 6, 2012 3:09 am
> wrote:
> > >
> >
> > > > Coding would
> > > definitely be
> easier for KNC than GPU, if your
> > target
> > > > is e.g. 40%
>
> > > efficiency.
> > > > However, suppose
> > big-K is
> 2TFlops.
> > > > Is it easier to
> > >
> > > > code KNC
> for
> > 90%, than GPU for 50%? (thus same performance)
> > > > The
>
> > > good thing
> > is
> > > > that KNC seems to have more
> bandwidth, so it might be
> > >
> > true.
> > > >
> > > It
> is unlikely that K20 is >1.5 TFlops. As of right
> > now there
> >
> > are no plans for the GK110 to be put in the consumer space, so
> > they
> cannot rely
> > > on binning good dies for the K product. They also have
>
> > a thermal envelope to
> > > fit into. So it is unlikely they'll be
> releasing
> > at 1 Ghz+. What we do know
> > > about K20 is that it is
> 15 SMX @ 64 DP per
> > SMX for a best case of 1920 TFlops @
> > > 1
> Ghz. Thermals + large die will
> > probably cost them 20-30% frequency
> putting
> > > them in the range of 1.4
> > TFlops. And even that is
> probably a bit generous since
> > > it is unlikely
> > that they'll be
> able enable all 15 SMX due to defects.
> > >
> > > And
> > >
>
> > yes, it is likely easier to code KNC for 90% than for GPU for 50% on
> average
> >
> > > across the relevant workloads. For one thing, all
> your code will compile
> > on KNC
> > > @ day 1 and KNC offer much more
> flexibility in the decomposing of a
> > program.
> > > KNC provides
> offload, symmetric, and host models. K20 only
> > provides
> > >
> offload.
> > >
> > > > 665GFlops/M2090 is theoretical.
> >
> >
> > > At launch time, Fermi
> > > Linpack is ~56%. So
> >
> > > it's likely
> > Assuming unchanged efficiency, big-K is
> >
> > also
> > > > around TFlops
> > Linpack.
> > > >
> >
> >
> > > I'll put my stake down at ~800-900
> > >
> > Gflops
> linpack performance for K20. I think you are assuming too high of a
> >
> peak
> > > for K20. Even my 1.4 Tflop peak is >2x 2090.
> >
> >
> Just a note on
> > the performance numbers mentioned before. From what I
> heard KNC does not have
> > the FMA instruction.
> >
> > With the 1.09
> Ghz that would land us a whopping 1.09 *
> > 61 * (512/64 ) = 558 GFLOPS DP,
> behind the current competition, in other words
> > it looks like KNC would be
> 2 years late to the market.
> >
> > On the K20 we can
> > expect
> conservative clocks at maybe ~ 0.8 Ghz which would put us at 15*64*0.8*2
> >
> => 1536 GFLOPs DP, way ahead of the competition... Given that GPUs are
>
> > already hitting 80% for matrix operation applications we should be seeing
> at
> > least 1.22 TFLOPs by a conservative estimate.
> >
> > More
> importantly, for real
> > world applications we should be seeing 15 * 192 * 2
> * 0.8 => 4608 SP GFLOPS.
> > And yes, single precision is extremely
> important within for example image and
> > signal processing
> applications.
> >
> > Cheers,
> >
>
> That's not correct, KNC/MIC has FMA
> support. Look at page 18 in the following PDF (at page 14 it also says that DP
> performance is 1 TFLOPS):
>
> 2012.04.25 Andrzej Nowak - An overview of Intel MIC
> - technology, hardware and software v3
Thanks for the source! Have been searching for this.



