By: jp (asdfasdf.delete@this.gmail.com), August 7, 2012 5:23 am
Room: Moderated Discussions
jp (asdfasdf.delete@this.gmail.com) on August 7, 2012 5:17 am wrote:
> Eric (eric.kjellen.delete@this.gmail.com) on August 7, 2012 4:58 am wrote:
> >
> jp (asdfasdf.delete@this.gmail.com) on August 7, 2012 3:08 am wrote:
> > >
> aaron
> > spink (aaronspink.delete@this.notearthlink.net) on August 6, 2012
> 4:33 am
> > >
> > wrote:
> > > > EBFE (x.delete@this.y.com) on
> August 6, 2012 3:09 am
> > wrote:
> > > >
> > >
> > >
> > > Coding would
> > > > definitely be
> > easier for KNC than
> GPU, if your
> > > target
> > > > > is e.g. 40%
> >
> >
> > > efficiency.
> > > > > However, suppose
> > > big-K is
>
> > 2TFlops.
> > > > > Is it easier to
> > > >
> >
> > > > code KNC
> > for
> > > 90%, than GPU for 50%? (thus same
> performance)
> > > > > The
> >
> > > > good thing
> >
> > is
> > > > > that KNC seems to have more
> > bandwidth, so
> it might be
> > > >
> > > true.
> > > > >
> > >
> > It
> > is unlikely that K20 is >1.5 TFlops. As of right
> > >
> now there
> > >
> > > are no plans for the GK110 to be put in the
> consumer space, so
> > > they
> > cannot rely
> > > > on
> binning good dies for the K product. They also have
> >
> > > a
> thermal envelope to
> > > > fit into. So it is unlikely they'll
> be
> > releasing
> > > at 1 Ghz+. What we do know
> > > >
> about K20 is that it is
> > 15 SMX @ 64 DP per
> > > SMX for a best case
> of 1920 TFlops @
> > > > 1
> > Ghz. Thermals + large die will
>
> > > probably cost them 20-30% frequency
> > putting
> > > >
> them in the range of 1.4
> > > TFlops. And even that is
> > probably a
> bit generous since
> > > > it is unlikely
> > > that they'll be
>
> > able enable all 15 SMX due to defects.
> > > >
> > > >
> And
> > > >
> >
> > > yes, it is likely easier to code KNC for
> 90% than for GPU for 50% on
> > average
> > >
> > > > across
> the relevant workloads. For one thing, all
> > your code will compile
> >
> > on KNC
> > > > @ day 1 and KNC offer much more
> > flexibility
> in the decomposing of a
> > > program.
> > > > KNC provides
>
> > offload, symmetric, and host models. K20 only
> > > provides
>
> > > >
> > offload.
> > > >
> > > > >
> 665GFlops/M2090 is theoretical.
> > >
> > >
> > > > At launch
> time, Fermi
> > > > Linpack is ~56%. So
> > >
> > > >
> it's likely
> > > Assuming unchanged efficiency, big-K is
> > >
>
> > > also
> > > > > around TFlops
> > > Linpack.
> >
> > > >
> > >
> > >
> > > > I'll put my stake down
> at ~800-900
> > > >
> > > Gflops
> > linpack performance for
> K20. I think you are assuming too high of a
> > >
> > peak
> > >
> > for K20. Even my 1.4 Tflop peak is >2x 2090.
> > >
> > >
>
> > Just a note on
> > > the performance numbers mentioned before. From
> what I
> > heard KNC does not have
> > > the FMA instruction.
> >
> >
> > > With the 1.09
> > Ghz that would land us a whopping 1.09 *
>
> > > 61 * (512/64 ) = 558 GFLOPS DP,
> > behind the current
> competition, in other words
> > > it looks like KNC would be
> > 2 years
> late to the market.
> > >
> > > On the K20 we can
> > > expect
>
> > conservative clocks at maybe ~ 0.8 Ghz which would put us at 15*64*0.8*2
>
> > >
> > => 1536 GFLOPs DP, way ahead of the competition... Given
> that GPUs are
> >
> > > already hitting 80% for matrix operation
> applications we should be seeing
> > at
> > > least 1.22 TFLOPs by a
> conservative estimate.
> > >
> > > More
> > importantly, for real
>
> > > world applications we should be seeing 15 * 192 * 2
> > * 0.8
> => 4608 SP GFLOPS.
> > > And yes, single precision is extremely
> >
> important within for example image and
> > > signal processing
> >
> applications.
> > >
> > > Cheers,
> > >
> >
> > That's not
> correct, KNC/MIC has FMA
> > support. Look at page 18 in the following PDF
> (at page 14 it also says that DP
> > performance is 1 TFLOPS):
> >
> >
> 2012.04.25 Andrzej Nowak - An overview of Intel MIC
> > - technology, hardware
> and software v3
>
> Thanks for the source! Have been searching for this.
>
Are they still maintaining cache coherency in KNC or was that only in KNF ? I'm guessing this would be a huge compatibility issue for x86 code if they did not.
It is also whats going to stop this hardware from scaling into the future ;)
> Eric (eric.kjellen.delete@this.gmail.com) on August 7, 2012 4:58 am wrote:
> >
> jp (asdfasdf.delete@this.gmail.com) on August 7, 2012 3:08 am wrote:
> > >
> aaron
> > spink (aaronspink.delete@this.notearthlink.net) on August 6, 2012
> 4:33 am
> > >
> > wrote:
> > > > EBFE (x.delete@this.y.com) on
> August 6, 2012 3:09 am
> > wrote:
> > > >
> > >
> > >
> > > Coding would
> > > > definitely be
> > easier for KNC than
> GPU, if your
> > > target
> > > > > is e.g. 40%
> >
> >
> > > efficiency.
> > > > > However, suppose
> > > big-K is
>
> > 2TFlops.
> > > > > Is it easier to
> > > >
> >
> > > > code KNC
> > for
> > > 90%, than GPU for 50%? (thus same
> performance)
> > > > > The
> >
> > > > good thing
> >
> > is
> > > > > that KNC seems to have more
> > bandwidth, so
> it might be
> > > >
> > > true.
> > > > >
> > >
> > It
> > is unlikely that K20 is >1.5 TFlops. As of right
> > >
> now there
> > >
> > > are no plans for the GK110 to be put in the
> consumer space, so
> > > they
> > cannot rely
> > > > on
> binning good dies for the K product. They also have
> >
> > > a
> thermal envelope to
> > > > fit into. So it is unlikely they'll
> be
> > releasing
> > > at 1 Ghz+. What we do know
> > > >
> about K20 is that it is
> > 15 SMX @ 64 DP per
> > > SMX for a best case
> of 1920 TFlops @
> > > > 1
> > Ghz. Thermals + large die will
>
> > > probably cost them 20-30% frequency
> > putting
> > > >
> them in the range of 1.4
> > > TFlops. And even that is
> > probably a
> bit generous since
> > > > it is unlikely
> > > that they'll be
>
> > able enable all 15 SMX due to defects.
> > > >
> > > >
> And
> > > >
> >
> > > yes, it is likely easier to code KNC for
> 90% than for GPU for 50% on
> > average
> > >
> > > > across
> the relevant workloads. For one thing, all
> > your code will compile
> >
> > on KNC
> > > > @ day 1 and KNC offer much more
> > flexibility
> in the decomposing of a
> > > program.
> > > > KNC provides
>
> > offload, symmetric, and host models. K20 only
> > > provides
>
> > > >
> > offload.
> > > >
> > > > >
> 665GFlops/M2090 is theoretical.
> > >
> > >
> > > > At launch
> time, Fermi
> > > > Linpack is ~56%. So
> > >
> > > >
> it's likely
> > > Assuming unchanged efficiency, big-K is
> > >
>
> > > also
> > > > > around TFlops
> > > Linpack.
> >
> > > >
> > >
> > >
> > > > I'll put my stake down
> at ~800-900
> > > >
> > > Gflops
> > linpack performance for
> K20. I think you are assuming too high of a
> > >
> > peak
> > >
> > for K20. Even my 1.4 Tflop peak is >2x 2090.
> > >
> > >
>
> > Just a note on
> > > the performance numbers mentioned before. From
> what I
> > heard KNC does not have
> > > the FMA instruction.
> >
> >
> > > With the 1.09
> > Ghz that would land us a whopping 1.09 *
>
> > > 61 * (512/64 ) = 558 GFLOPS DP,
> > behind the current
> competition, in other words
> > > it looks like KNC would be
> > 2 years
> late to the market.
> > >
> > > On the K20 we can
> > > expect
>
> > conservative clocks at maybe ~ 0.8 Ghz which would put us at 15*64*0.8*2
>
> > >
> > => 1536 GFLOPs DP, way ahead of the competition... Given
> that GPUs are
> >
> > > already hitting 80% for matrix operation
> applications we should be seeing
> > at
> > > least 1.22 TFLOPs by a
> conservative estimate.
> > >
> > > More
> > importantly, for real
>
> > > world applications we should be seeing 15 * 192 * 2
> > * 0.8
> => 4608 SP GFLOPS.
> > > And yes, single precision is extremely
> >
> important within for example image and
> > > signal processing
> >
> applications.
> > >
> > > Cheers,
> > >
> >
> > That's not
> correct, KNC/MIC has FMA
> > support. Look at page 18 in the following PDF
> (at page 14 it also says that DP
> > performance is 1 TFLOPS):
> >
> >
> 2012.04.25 Andrzej Nowak - An overview of Intel MIC
> > - technology, hardware
> and software v3
>
> Thanks for the source! Have been searching for this.
>
Are they still maintaining cache coherency in KNC or was that only in KNF ? I'm guessing this would be a huge compatibility issue for x86 code if they did not.
It is also whats going to stop this hardware from scaling into the future ;)



