By: Emil Briggs (me.delete@this.nowherespam.com), July 28, 2012 5:11 pm
Room: Moderated Discussions
aaron spink (aaronspink.delete@this.notearthlink.net) on July 28, 2012 7:05 am wrote:
> Emil Briggs (me.delete@this.nowherespam.com) on July 28, 2012 6:40 am
> wrote:
>
> > By ORNL do you mean Oak Ridge National Laboratory? I ask
> >
> since I am a large user at ORNL. Currently only some of the Jaguar/Titan nodes
>
> > are equipped with GPU's but there are enough of them installed to do
> realistic
> > evaluations. For certain workloads (and when properly
> programmed) they beat
> > CPU's pretty handily performance wise. And that
> data comes from a real world
> > application not LINPACK. The section of the
> code that I adapted for GPU's runs 3
> > to 4 times faster than it does on
> CPU's. It's also possible with this particular
> > application to overlap some
> operations on the CPU and GPU and hide the latency
> > of PCI-E data
> transfers. Obviously not all applications can benefit in the same
> > way and
> it's not easy to do so even when possible but GPU's can offer some very
> >
> nice performance gains in some cases.
> >
>
> I'm not denying that there are
> some workloads and some kernels that have an advantage, I am however saying that
> it is generally the exception based on data published from both PRACE and ORNL,
> et al.
>
> BTW, what % of peak are you seeing on the GPUs with your code?
>
There are two places where we are using GPU's. One of them consists of large matrix operations and are done using the Nvidia cublas library. Those run close to 80% of peak. The tricky part of the work here is keeping the CPU's busy doing something useful while moving the matrices back and forth to the GPU. The other place is some finite difference routines. Still working on this. It's faster than doing it all on the CPU's but not by much. I'm trying to get more overlap between the CPU and GPU's here but this section of the code is not as suitable for that as the first.
> >
> That being said I do think that the
> > cost of moving data across the PCI-E
> bus and the difficulty of the programming
> > model are some real downsides
> to GPU's. How all that plays out will be
> > interesting and I'm looking
> forward to getting my hands on some Intel MIC
> > hardware to see what we can
> do with it.
> >
>
> eventually see x32/x40 PCI-E interfaces or dual QPI interfaces. Certainly would
> be nice if the GPUs/MICs had simple coherent access to memory.
Agreed. How difficult do you think it would be to implement coherent memory access?
> Emil Briggs (me.delete@this.nowherespam.com) on July 28, 2012 6:40 am
> wrote:
>
> > By ORNL do you mean Oak Ridge National Laboratory? I ask
> >
> since I am a large user at ORNL. Currently only some of the Jaguar/Titan nodes
>
> > are equipped with GPU's but there are enough of them installed to do
> realistic
> > evaluations. For certain workloads (and when properly
> programmed) they beat
> > CPU's pretty handily performance wise. And that
> data comes from a real world
> > application not LINPACK. The section of the
> code that I adapted for GPU's runs 3
> > to 4 times faster than it does on
> CPU's. It's also possible with this particular
> > application to overlap some
> operations on the CPU and GPU and hide the latency
> > of PCI-E data
> transfers. Obviously not all applications can benefit in the same
> > way and
> it's not easy to do so even when possible but GPU's can offer some very
> >
> nice performance gains in some cases.
> >
>
> I'm not denying that there are
> some workloads and some kernels that have an advantage, I am however saying that
> it is generally the exception based on data published from both PRACE and ORNL,
> et al.
>
> BTW, what % of peak are you seeing on the GPUs with your code?
>
There are two places where we are using GPU's. One of them consists of large matrix operations and are done using the Nvidia cublas library. Those run close to 80% of peak. The tricky part of the work here is keeping the CPU's busy doing something useful while moving the matrices back and forth to the GPU. The other place is some finite difference routines. Still working on this. It's faster than doing it all on the CPU's but not by much. I'm trying to get more overlap between the CPU and GPU's here but this section of the code is not as suitable for that as the first.
> >
> That being said I do think that the
> > cost of moving data across the PCI-E
> bus and the difficulty of the programming
> > model are some real downsides
> to GPU's. How all that plays out will be
> > interesting and I'm looking
> forward to getting my hands on some Intel MIC
> > hardware to see what we can
> do with it.
> >
>
> eventually see x32/x40 PCI-E interfaces or dual QPI interfaces. Certainly would
> be nice if the GPUs/MICs had simple coherent access to memory.
Agreed. How difficult do you think it would be to implement coherent memory access?
Topic | Posted By | Date |
---|---|---|
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 12:37 AM |
New Article: Compute Efficiency 2012 | SHK | 2012/07/25 01:31 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 01:42 AM |
New Article: Compute Efficiency 2012 | none | 2012/07/25 02:18 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 10:25 AM |
GCN (NT) | EBFE | 2012/07/25 02:25 AM |
GCN - TFLOP DP | jp | 2012/08/09 12:58 PM |
GCN - TFLOP DP | David Kanter | 2012/08/09 02:32 PM |
GCN - TFLOP DP | Kevin G | 2012/08/11 04:22 PM |
GCN - TFLOP DP | Eric | 2012/08/09 04:12 PM |
GCN - TFLOP DP | jp | 2012/08/10 12:23 AM |
GCN - TFLOP DP | EBFE | 2012/08/12 07:27 PM |
GCN - TFLOP DP | jp | 2012/08/13 01:02 AM |
GCN - TFLOP DP | EBFE | 2012/08/13 06:45 PM |
GCN - TFLOP DP | jp | 2012/08/14 12:21 AM |
New Article: Compute Efficiency 2012 | Adrian | 2012/07/25 03:39 AM |
New Article: Compute Efficiency 2012 | EBFE | 2012/07/25 08:33 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 10:11 AM |
New Article: Compute Efficiency 2012 | sf | 2012/07/25 05:46 AM |
New Article: Compute Efficiency 2012 | aaron spink | 2012/07/25 08:08 AM |
New Article: Compute Efficiency 2012 | someone | 2012/07/25 09:06 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 10:14 AM |
New Article: Compute Efficiency 2012 | EBFE | 2012/07/26 01:27 AM |
BG/Q | David Kanter | 2012/07/26 08:31 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/03 12:57 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/03 06:59 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/04 05:37 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/04 05:51 PM |
Leaks != products | David Kanter | 2012/08/05 02:19 AM |
Leaks != products | EBFE | 2012/08/06 01:49 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/05 09:37 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/06 02:09 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/06 03:33 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 02:08 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/07 03:58 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 04:17 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/07 04:22 AM |
VR-ZONE KNC B0 leak, poor number? | anonymou5 | 2012/08/07 08:43 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 04:23 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/07 06:24 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/07 06:20 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 10:22 AM |
VR-ZONE KNC B0 leak, poor number? | EduardoS | 2012/08/07 02:15 PM |
KNC has FMA | David Kanter | 2012/08/07 08:17 AM |
New Article: Compute Efficiency 2012 | forestlaughing | 2012/07/25 07:51 AM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 04:12 AM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/27 10:53 AM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 11:51 AM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/27 01:48 PM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 02:29 PM |
New Article: Compute Efficiency 2012 | anon | 2012/07/29 01:25 AM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/29 10:53 AM |
Efficiency? No, lack of highly useful features | someone | 2012/07/25 08:58 AM |
Best case for GPUs | David Kanter | 2012/07/25 10:28 AM |
Best case for GPUs | franzliszt | 2012/07/25 12:39 PM |
Best case for GPUs | Chuck | 2012/07/25 07:13 PM |
Best case for GPUs | David Kanter | 2012/07/25 08:45 PM |
Best case for GPUs | Eric | 2012/07/27 04:51 AM |
Silverthorn data point | Michael S | 2012/07/25 01:45 PM |
Silverthorn data point | David Kanter | 2012/07/25 03:06 PM |
New Article: Compute Efficiency 2012 | Unununium | 2012/07/25 04:55 PM |
New Article: Compute Efficiency 2012 | EduardoS | 2012/07/25 07:12 PM |
Ops... I'm wrong... | EduardoS | 2012/07/25 07:14 PM |
New Article: Compute Efficiency 2012 | TacoBell | 2012/07/25 07:36 PM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 08:49 PM |
New Article: Compute Efficiency 2012 | Michael S | 2012/07/26 01:33 AM |
Line and factor | Moritz | 2012/07/26 12:34 AM |
Line and factor | Peter Boyle | 2012/07/27 06:57 AM |
not entirely | Moritz | 2012/07/27 11:22 AM |
Line and factor | EduardoS | 2012/07/27 04:24 PM |
Line and factor | Moritz | 2012/07/28 11:52 AM |
tables | Michael S | 2012/07/26 01:39 AM |
Interlagos L2+L3 | Rana | 2012/07/26 02:13 AM |
Interlagos L2+L3 | Rana | 2012/07/26 02:13 AM |
Interlagos L2+L3 | David Kanter | 2012/07/26 08:21 AM |
SP vs DP & performance metrics | jp | 2012/07/27 06:08 AM |
SP vs DP & performance metrics | Eric | 2012/07/27 06:57 AM |
SP vs DP & performance metrics | jp | 2012/07/27 08:18 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 08:36 AM |
SP vs DP & performance metrics | jp | 2012/07/27 08:47 AM |
"Global" --> system | Paul A. Clayton | 2012/07/27 09:31 AM |
"Global" --> system | jp | 2012/07/27 02:55 PM |
"Global" --> system | aaron spink | 2012/07/27 06:33 PM |
"Global" --> system | jp | 2012/07/28 01:00 AM |
"Global" --> system | aaron spink | 2012/07/28 05:54 AM |
"Global" --> system | jp | 2012/07/29 01:12 AM |
"Global" --> system | aaron spink | 2012/07/29 04:03 AM |
"Global" --> system | none | 2012/07/29 08:05 AM |
"Global" --> system | EduardoS | 2012/07/29 09:26 AM |
"Global" --> system | jp | 2012/07/30 01:24 AM |
"Global" --> system | aaron spink | 2012/07/30 02:05 AM |
"Global" --> system | aaron spink | 2012/07/30 02:03 AM |
daxpy is STREAM TRIAD | Paul A. Clayton | 2012/07/30 05:10 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 06:25 PM |
SP vs DP & performance metrics | Emil Briggs | 2012/07/28 05:40 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/28 06:05 AM |
SP vs DP & performance metrics | jp | 2012/07/28 10:04 AM |
SP vs DP & performance metrics | Brett | 2012/07/28 02:32 PM |
SP vs DP & performance metrics | Emil Briggs | 2012/07/28 05:11 PM |
SP vs DP & performance metrics | anon | 2012/07/29 01:53 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/29 04:39 AM |
Coherency for discretes | Rohit | 2012/07/29 08:24 AM |
SP vs DP & performance metrics | anon | 2012/07/29 10:09 AM |
SP vs DP & performance metrics | Eric | 2012/07/29 12:08 PM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 08:25 AM |
Regular updates? | Joe | 2012/07/27 08:35 AM |
New Article: Compute Efficiency 2012 | 309 | 2012/07/27 09:34 PM |
New Article: Compute Efficiency 2012 | Ingeneer | 2012/07/30 08:01 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/30 12:11 PM |
New Article: Compute Efficiency 2012 | Ingeneer | 2012/07/30 07:04 PM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/30 08:32 PM |
Memory power and bandwidth? | Iain McClatchie | 2012/08/03 03:35 PM |
Memory power and bandwidth? | David Kanter | 2012/08/04 10:22 AM |
Memory power and bandwidth? | Michael S | 2012/08/04 01:36 PM |
Memory power and bandwidth? | Iain McClatchie | 2012/08/06 01:09 PM |
Memory power and bandwidth? | Eric | 2012/08/07 05:28 PM |
Workloads | David Kanter | 2012/08/08 09:49 AM |
Workloads | Eric | 2012/08/09 04:21 PM |
Latency and bandwidth bottlenecks | Paul A. Clayton | 2012/08/08 03:02 PM |
Latency and bandwidth bottlenecks | Eric | 2012/08/09 04:32 PM |
Latency and bandwidth bottlenecks | none | 2012/08/10 05:06 AM |
Latency and bandwidth bottlenecks -> BDP | ajensen | 2012/08/11 02:21 PM |
Memory power and bandwidth? | Ingeneer | 2012/08/06 10:26 AM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/11 12:21 PM |
NV aims for 1.8+ TFLOPS DP ? | David Kanter | 2012/08/11 08:25 PM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/12 01:45 AM |
NV aims for 1.8+ TFLOPS DP ? | EBFE | 2012/08/12 09:02 PM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/13 12:54 AM |
NV aims for 1.8+ TFLOPS DP ? | Gabriele Svelto | 2012/08/13 08:16 AM |
NV aims for 1.8+ TFLOPS DP ? | Vincent Diepeveen | 2012/08/14 02:04 AM |
NV aims for 1.8+ TFLOPS DP ? | David Kanter | 2012/08/13 08:50 AM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/13 10:17 AM |
NV aims for 1.8+ TFLOPS DP ? | EduardoS | 2012/08/13 05:45 AM |