By: anon (anon.delete@this.anon.com), July 29, 2012 2:53 am
Room: Moderated Discussions
Emil Briggs (me.delete@this.nowherespam.com) on July 28, 2012 6:11 pm wrote:
> aaron spink (aaronspink.delete@this.notearthlink.net) on July 28, 2012 7:05 am
> wrote:
> > Emil Briggs (me.delete@this.nowherespam.com) on July 28, 2012 6:40
> am
> > wrote:
> >
> > > By ORNL do you mean Oak Ridge National
> Laboratory? I ask
> > >
> > since I am a large user at ORNL. Currently
> only some of the Jaguar/Titan nodes
> >
> > > are equipped with GPU's
> but there are enough of them installed to do
> > realistic
> > >
> evaluations. For certain workloads (and when properly
> > programmed) they
> beat
> > > CPU's pretty handily performance wise. And that
> > data
> comes from a real world
> > > application not LINPACK. The section of the
>
> > code that I adapted for GPU's runs 3
> > > to 4 times faster than it
> does on
> > CPU's. It's also possible with this particular
> > >
> application to overlap some
> > operations on the CPU and GPU and hide the
> latency
> > > of PCI-E data
> > transfers. Obviously not all
> applications can benefit in the same
> > > way and
> > it's not easy to
> do so even when possible but GPU's can offer some very
> > >
> > nice
> performance gains in some cases.
> > >
> >
> > I'm not denying that
> there are
> > some workloads and some kernels that have an advantage, I am
> however saying that
> > it is generally the exception based on data published
> from both PRACE and ORNL,
> > et al.
> >
> > BTW, what % of peak are you
> seeing on the GPUs with your code?
> >
>
> There are two places where we are
> using GPU's. One of them consists of large matrix operations and are done using
> the Nvidia cublas library. Those run close to 80% of peak. The tricky part of
> the work here is keeping the CPU's busy doing something useful while moving the
> matrices back and forth to the GPU. The other place is some finite difference
> routines. Still working on this. It's faster than doing it all on the CPU's but
> not by much. I'm trying to get more overlap between the CPU and GPU's here but
> this section of the code is not as suitable for that as the first.
>
> >
> >
> > That being said I do think that the
> > > cost of moving data
> across the PCI-E
> > bus and the difficulty of the programming
> > >
> model are some real downsides
> > to GPU's. How all that plays out will be
>
> > > interesting and I'm looking
> > forward to getting my hands on
> some Intel MIC
> > > hardware to see what we can
> > do with it.
> >
> >
> >
>
> > eventually see x32/x40 PCI-E interfaces or dual QPI
> interfaces. Certainly would
> > be nice if the GPUs/MICs had simple coherent
> access to memory.
>
> Agreed. How difficult do you think it would be to implement
> coherent memory access?
Probably not all that much harder than anything else they have to do. But I would bet that neither Intel nor AMD would allow NVIDIA to implement it, so you will see it in AMD or Intel GPUs if ever.
> aaron spink (aaronspink.delete@this.notearthlink.net) on July 28, 2012 7:05 am
> wrote:
> > Emil Briggs (me.delete@this.nowherespam.com) on July 28, 2012 6:40
> am
> > wrote:
> >
> > > By ORNL do you mean Oak Ridge National
> Laboratory? I ask
> > >
> > since I am a large user at ORNL. Currently
> only some of the Jaguar/Titan nodes
> >
> > > are equipped with GPU's
> but there are enough of them installed to do
> > realistic
> > >
> evaluations. For certain workloads (and when properly
> > programmed) they
> beat
> > > CPU's pretty handily performance wise. And that
> > data
> comes from a real world
> > > application not LINPACK. The section of the
>
> > code that I adapted for GPU's runs 3
> > > to 4 times faster than it
> does on
> > CPU's. It's also possible with this particular
> > >
> application to overlap some
> > operations on the CPU and GPU and hide the
> latency
> > > of PCI-E data
> > transfers. Obviously not all
> applications can benefit in the same
> > > way and
> > it's not easy to
> do so even when possible but GPU's can offer some very
> > >
> > nice
> performance gains in some cases.
> > >
> >
> > I'm not denying that
> there are
> > some workloads and some kernels that have an advantage, I am
> however saying that
> > it is generally the exception based on data published
> from both PRACE and ORNL,
> > et al.
> >
> > BTW, what % of peak are you
> seeing on the GPUs with your code?
> >
>
> There are two places where we are
> using GPU's. One of them consists of large matrix operations and are done using
> the Nvidia cublas library. Those run close to 80% of peak. The tricky part of
> the work here is keeping the CPU's busy doing something useful while moving the
> matrices back and forth to the GPU. The other place is some finite difference
> routines. Still working on this. It's faster than doing it all on the CPU's but
> not by much. I'm trying to get more overlap between the CPU and GPU's here but
> this section of the code is not as suitable for that as the first.
>
> >
> >
> > That being said I do think that the
> > > cost of moving data
> across the PCI-E
> > bus and the difficulty of the programming
> > >
> model are some real downsides
> > to GPU's. How all that plays out will be
>
> > > interesting and I'm looking
> > forward to getting my hands on
> some Intel MIC
> > > hardware to see what we can
> > do with it.
> >
> >
> >
>
> > eventually see x32/x40 PCI-E interfaces or dual QPI
> interfaces. Certainly would
> > be nice if the GPUs/MICs had simple coherent
> access to memory.
>
> Agreed. How difficult do you think it would be to implement
> coherent memory access?
Probably not all that much harder than anything else they have to do. But I would bet that neither Intel nor AMD would allow NVIDIA to implement it, so you will see it in AMD or Intel GPUs if ever.



