By: anon (anon.delete@this.anon.com), July 29, 2012 11:09 am
Room: Moderated Discussions
aaron spink (aaronspink.delete@this.notearthlink.net) on July 29, 2012 5:39 am wrote:
> Emil Briggs (me.delete@this.nowherespam.com) on July 28, 2012 6:11 pm
> wrote:
>
> > There are two places where we are
> > using GPU's. One of them
> consists of large matrix operations and are done using
> > the Nvidia cublas
> library. Those run close to 80% of peak. The tricky part of
> > the work here
> is keeping the CPU's busy doing something useful while moving the
> >
> matrices back and forth to the GPU. The other place is some finite difference
>
> > routines. Still working on this. It's faster than doing it all on the
> CPU's but
> > not by much. I'm trying to get more overlap between the CPU and
> GPU's here but
> > this section of the code is not as suitable for that as
> the first.
> >
>
> 80% is very very good. Esp compared to linpack.
>
> >
>
> > Agreed. How difficult do you think it would be to implement
> >
> coherent memory access?
> >
>
> Depends. If you can convince PCI SIG to
> implement a coherent protocol the difficulty shouldn't be that high. Efficiency
> wouldn't be the greatest since the basic protocols for PCI-E are designed around
> large block transfers but it would be doable esp with the move to integrating
> PCI-E on die.
>
> Probably the easiest solution for something like MIC would be
> to integrate a QPI interface in addition to the PCI-E interface. At that point
> it primarily becomes an exercise in setting up the memory map reasonably/sanely.
> The MIC would need both a caching agent and a home coherency agent, but it
> should be possible to do some cut and paste.
>
> What would be more difficult in
> the QPI + PCI-E space is get maximum advantage out of it. You would ideally
> like to use both the QPI agent and the PCI-E agent for bulk DMA traffic while
> only using the QPI agent for coherent traffic. Using the QPI link for bulk DMA
> would likely take some work with the various DMA engines. For the MIC local
> coherent memory you would likely only make a subset of it available for coherent
> access from the CPU in order to simplify the performance requirements (likely a
> variable window size that is programmable) so that all memory accesses from the
> MIC to its local memory don't have to remote snoop though if you have the area
> available, you might be able to get away with an SRAM based directory (basically
> limited capacity coherency from the MIC to CPU aka evict to make space) as well.
>
>
> For the CPU memory, you would likely be unrestricted assuming that the CPU
> side used some form of directory.
>
> Total DMA bandwidth should be at least
> equal to 32x PCI-E. And for coherent access you are looking at a minimum of 16x
> PCI-E bandwidth.
>
> And from a practical standpoint you are going to want 2xQPI
> or QPI+PCI-E since it is unlikely that the market requirement will be there for
> the CPUs to have 3x 16x PCI-E. Though if your network interface chip runs over
> a single QPI link it might be viable, but I kinda see the ideal setup for a top
> end super as 1 QPI + 16x PCI-E to both the MIC/GPU and to the network interface.
> So you would be looking at ~32+GB/s (at current speeds, likely 64+ GB/s in the
> 2015 timeframe baed on PCI-E 4.0 announced goals) in and out of the CPU to both
> network and MIC for a total of 128 GB/s which means that memory bandwidth likely
> becomes you main bottleneck.
>
> Also lets not forget that by the time this
> happens we are likely going to see some form of stacked memory in reasonably
> wide use, which means that the CPUs will likely have 1-4 GB of ultra high
> bandwidth "cache".
It does not have to be a cache of main memory; the stacked DRAM could be (part of) main memory itself. Why not?
> Which if used right would provide enough bandwidth buffer to
> have the I/Os plus having the option to direct route to/from networkMIC would
> make the 102.4 GB/s CPU memory subsystem reasonable.
>
> The next big problem is
> going to be feeding the network bandwidth. With that type of IO capability you
> would need 9+ 4x FDR IB connections per node. And you're probably going to want
> a switchless topology, can that would be a lot of switches.
> Emil Briggs (me.delete@this.nowherespam.com) on July 28, 2012 6:11 pm
> wrote:
>
> > There are two places where we are
> > using GPU's. One of them
> consists of large matrix operations and are done using
> > the Nvidia cublas
> library. Those run close to 80% of peak. The tricky part of
> > the work here
> is keeping the CPU's busy doing something useful while moving the
> >
> matrices back and forth to the GPU. The other place is some finite difference
>
> > routines. Still working on this. It's faster than doing it all on the
> CPU's but
> > not by much. I'm trying to get more overlap between the CPU and
> GPU's here but
> > this section of the code is not as suitable for that as
> the first.
> >
>
> 80% is very very good. Esp compared to linpack.
>
> >
>
> > Agreed. How difficult do you think it would be to implement
> >
> coherent memory access?
> >
>
> Depends. If you can convince PCI SIG to
> implement a coherent protocol the difficulty shouldn't be that high. Efficiency
> wouldn't be the greatest since the basic protocols for PCI-E are designed around
> large block transfers but it would be doable esp with the move to integrating
> PCI-E on die.
>
> Probably the easiest solution for something like MIC would be
> to integrate a QPI interface in addition to the PCI-E interface. At that point
> it primarily becomes an exercise in setting up the memory map reasonably/sanely.
> The MIC would need both a caching agent and a home coherency agent, but it
> should be possible to do some cut and paste.
>
> What would be more difficult in
> the QPI + PCI-E space is get maximum advantage out of it. You would ideally
> like to use both the QPI agent and the PCI-E agent for bulk DMA traffic while
> only using the QPI agent for coherent traffic. Using the QPI link for bulk DMA
> would likely take some work with the various DMA engines. For the MIC local
> coherent memory you would likely only make a subset of it available for coherent
> access from the CPU in order to simplify the performance requirements (likely a
> variable window size that is programmable) so that all memory accesses from the
> MIC to its local memory don't have to remote snoop though if you have the area
> available, you might be able to get away with an SRAM based directory (basically
> limited capacity coherency from the MIC to CPU aka evict to make space) as well.
>
>
> For the CPU memory, you would likely be unrestricted assuming that the CPU
> side used some form of directory.
>
> Total DMA bandwidth should be at least
> equal to 32x PCI-E. And for coherent access you are looking at a minimum of 16x
> PCI-E bandwidth.
>
> And from a practical standpoint you are going to want 2xQPI
> or QPI+PCI-E since it is unlikely that the market requirement will be there for
> the CPUs to have 3x 16x PCI-E. Though if your network interface chip runs over
> a single QPI link it might be viable, but I kinda see the ideal setup for a top
> end super as 1 QPI + 16x PCI-E to both the MIC/GPU and to the network interface.
> So you would be looking at ~32+GB/s (at current speeds, likely 64+ GB/s in the
> 2015 timeframe baed on PCI-E 4.0 announced goals) in and out of the CPU to both
> network and MIC for a total of 128 GB/s which means that memory bandwidth likely
> becomes you main bottleneck.
>
> Also lets not forget that by the time this
> happens we are likely going to see some form of stacked memory in reasonably
> wide use, which means that the CPUs will likely have 1-4 GB of ultra high
> bandwidth "cache".
It does not have to be a cache of main memory; the stacked DRAM could be (part of) main memory itself. Why not?
> Which if used right would provide enough bandwidth buffer to
> have the I/Os plus having the option to direct route to/from networkMIC would
> make the 102.4 GB/s CPU memory subsystem reasonable.
>
> The next big problem is
> going to be feeding the network bandwidth. With that type of IO capability you
> would need 9+ 4x FDR IB connections per node. And you're probably going to want
> a switchless topology, can that would be a lot of switches.
Topic | Posted By | Date |
---|---|---|
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 01:37 AM |
New Article: Compute Efficiency 2012 | SHK | 2012/07/25 02:31 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 02:42 AM |
New Article: Compute Efficiency 2012 | none | 2012/07/25 03:18 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:25 AM |
GCN (NT) | EBFE | 2012/07/25 03:25 AM |
GCN - TFLOP DP | jp | 2012/08/09 01:58 PM |
GCN - TFLOP DP | David Kanter | 2012/08/09 03:32 PM |
GCN - TFLOP DP | Kevin G | 2012/08/11 05:22 PM |
GCN - TFLOP DP | Eric | 2012/08/09 05:12 PM |
GCN - TFLOP DP | jp | 2012/08/10 01:23 AM |
GCN - TFLOP DP | EBFE | 2012/08/12 08:27 PM |
GCN - TFLOP DP | jp | 2012/08/13 02:02 AM |
GCN - TFLOP DP | EBFE | 2012/08/13 07:45 PM |
GCN - TFLOP DP | jp | 2012/08/14 01:21 AM |
New Article: Compute Efficiency 2012 | Adrian | 2012/07/25 04:39 AM |
New Article: Compute Efficiency 2012 | EBFE | 2012/07/25 09:33 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:11 AM |
New Article: Compute Efficiency 2012 | sf | 2012/07/25 06:46 AM |
New Article: Compute Efficiency 2012 | aaron spink | 2012/07/25 09:08 AM |
New Article: Compute Efficiency 2012 | someone | 2012/07/25 10:06 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:14 AM |
New Article: Compute Efficiency 2012 | EBFE | 2012/07/26 02:27 AM |
BG/Q | David Kanter | 2012/07/26 09:31 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/03 01:57 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/03 07:59 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/04 06:37 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/04 06:51 PM |
Leaks != products | David Kanter | 2012/08/05 03:19 AM |
Leaks != products | EBFE | 2012/08/06 02:49 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/05 10:37 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/06 03:09 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/06 04:33 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 03:08 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/07 04:58 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 05:17 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/07 05:22 AM |
VR-ZONE KNC B0 leak, poor number? | anonymou5 | 2012/08/07 09:43 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 05:23 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/07 07:24 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/07 07:20 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 11:22 AM |
VR-ZONE KNC B0 leak, poor number? | EduardoS | 2012/08/07 03:15 PM |
KNC has FMA | David Kanter | 2012/08/07 09:17 AM |
New Article: Compute Efficiency 2012 | forestlaughing | 2012/07/25 08:51 AM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 05:12 AM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/27 11:53 AM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 12:51 PM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/27 02:48 PM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 03:29 PM |
New Article: Compute Efficiency 2012 | anon | 2012/07/29 02:25 AM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/29 11:53 AM |
Efficiency? No, lack of highly useful features | someone | 2012/07/25 09:58 AM |
Best case for GPUs | David Kanter | 2012/07/25 11:28 AM |
Best case for GPUs | franzliszt | 2012/07/25 01:39 PM |
Best case for GPUs | Chuck | 2012/07/25 08:13 PM |
Best case for GPUs | David Kanter | 2012/07/25 09:45 PM |
Best case for GPUs | Eric | 2012/07/27 05:51 AM |
Silverthorn data point | Michael S | 2012/07/25 02:45 PM |
Silverthorn data point | David Kanter | 2012/07/25 04:06 PM |
New Article: Compute Efficiency 2012 | Unununium | 2012/07/25 05:55 PM |
New Article: Compute Efficiency 2012 | EduardoS | 2012/07/25 08:12 PM |
Ops... I'm wrong... | EduardoS | 2012/07/25 08:14 PM |
New Article: Compute Efficiency 2012 | TacoBell | 2012/07/25 08:36 PM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 09:49 PM |
New Article: Compute Efficiency 2012 | Michael S | 2012/07/26 02:33 AM |
Line and factor | Moritz | 2012/07/26 01:34 AM |
Line and factor | Peter Boyle | 2012/07/27 07:57 AM |
not entirely | Moritz | 2012/07/27 12:22 PM |
Line and factor | EduardoS | 2012/07/27 05:24 PM |
Line and factor | Moritz | 2012/07/28 12:52 PM |
tables | Michael S | 2012/07/26 02:39 AM |
Interlagos L2+L3 | Rana | 2012/07/26 03:13 AM |
Interlagos L2+L3 | Rana | 2012/07/26 03:13 AM |
Interlagos L2+L3 | David Kanter | 2012/07/26 09:21 AM |
SP vs DP & performance metrics | jp | 2012/07/27 07:08 AM |
SP vs DP & performance metrics | Eric | 2012/07/27 07:57 AM |
SP vs DP & performance metrics | jp | 2012/07/27 09:18 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 09:36 AM |
SP vs DP & performance metrics | jp | 2012/07/27 09:47 AM |
"Global" --> system | Paul A. Clayton | 2012/07/27 10:31 AM |
"Global" --> system | jp | 2012/07/27 03:55 PM |
"Global" --> system | aaron spink | 2012/07/27 07:33 PM |
"Global" --> system | jp | 2012/07/28 02:00 AM |
"Global" --> system | aaron spink | 2012/07/28 06:54 AM |
"Global" --> system | jp | 2012/07/29 02:12 AM |
"Global" --> system | aaron spink | 2012/07/29 05:03 AM |
"Global" --> system | none | 2012/07/29 09:05 AM |
"Global" --> system | EduardoS | 2012/07/29 10:26 AM |
"Global" --> system | jp | 2012/07/30 02:24 AM |
"Global" --> system | aaron spink | 2012/07/30 03:05 AM |
"Global" --> system | aaron spink | 2012/07/30 03:03 AM |
daxpy is STREAM TRIAD | Paul A. Clayton | 2012/07/30 06:10 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 07:25 PM |
SP vs DP & performance metrics | Emil Briggs | 2012/07/28 06:40 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/28 07:05 AM |
SP vs DP & performance metrics | jp | 2012/07/28 11:04 AM |
SP vs DP & performance metrics | Brett | 2012/07/28 03:32 PM |
SP vs DP & performance metrics | Emil Briggs | 2012/07/28 06:11 PM |
SP vs DP & performance metrics | anon | 2012/07/29 02:53 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/29 05:39 AM |
Coherency for discretes | Rohit | 2012/07/29 09:24 AM |
SP vs DP & performance metrics | anon | 2012/07/29 11:09 AM |
SP vs DP & performance metrics | Eric | 2012/07/29 01:08 PM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 09:25 AM |
Regular updates? | Joe | 2012/07/27 09:35 AM |
New Article: Compute Efficiency 2012 | 309 | 2012/07/27 10:34 PM |
New Article: Compute Efficiency 2012 | Ingeneer | 2012/07/30 09:01 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/30 01:11 PM |
New Article: Compute Efficiency 2012 | Ingeneer | 2012/07/30 08:04 PM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/30 09:32 PM |
Memory power and bandwidth? | Iain McClatchie | 2012/08/03 04:35 PM |
Memory power and bandwidth? | David Kanter | 2012/08/04 11:22 AM |
Memory power and bandwidth? | Michael S | 2012/08/04 02:36 PM |
Memory power and bandwidth? | Iain McClatchie | 2012/08/06 02:09 PM |
Memory power and bandwidth? | Eric | 2012/08/07 06:28 PM |
Workloads | David Kanter | 2012/08/08 10:49 AM |
Workloads | Eric | 2012/08/09 05:21 PM |
Latency and bandwidth bottlenecks | Paul A. Clayton | 2012/08/08 04:02 PM |
Latency and bandwidth bottlenecks | Eric | 2012/08/09 05:32 PM |
Latency and bandwidth bottlenecks | none | 2012/08/10 06:06 AM |
Latency and bandwidth bottlenecks -> BDP | ajensen | 2012/08/11 03:21 PM |
Memory power and bandwidth? | Ingeneer | 2012/08/06 11:26 AM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/11 01:21 PM |
NV aims for 1.8+ TFLOPS DP ? | David Kanter | 2012/08/11 09:25 PM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/12 02:45 AM |
NV aims for 1.8+ TFLOPS DP ? | EBFE | 2012/08/12 10:02 PM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/13 01:54 AM |
NV aims for 1.8+ TFLOPS DP ? | Gabriele Svelto | 2012/08/13 09:16 AM |
NV aims for 1.8+ TFLOPS DP ? | Vincent Diepeveen | 2012/08/14 03:04 AM |
NV aims for 1.8+ TFLOPS DP ? | David Kanter | 2012/08/13 09:50 AM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/13 11:17 AM |
NV aims for 1.8+ TFLOPS DP ? | EduardoS | 2012/08/13 06:45 AM |