SP vs DP & performance metrics

Article: Computational Efficiency for CPUs and GPUs in 2012
By: anon (anon.delete@this.anon.com), July 29, 2012 10:09 am
Room: Moderated Discussions
aaron spink (aaronspink.delete@this.notearthlink.net) on July 29, 2012 5:39 am wrote:
> Emil Briggs (me.delete@this.nowherespam.com) on July 28, 2012 6:11 pm
> wrote:
>
> > There are two places where we are
> > using GPU's. One of them
> consists of large matrix operations and are done using
> > the Nvidia cublas
> library. Those run close to 80% of peak. The tricky part of
> > the work here
> is keeping the CPU's busy doing something useful while moving the
> >
> matrices back and forth to the GPU. The other place is some finite difference
>
> > routines. Still working on this. It's faster than doing it all on the
> CPU's but
> > not by much. I'm trying to get more overlap between the CPU and
> GPU's here but
> > this section of the code is not as suitable for that as
> the first.
> >
>
> 80% is very very good. Esp compared to linpack.
>
> >
>
> > Agreed. How difficult do you think it would be to implement
> >
> coherent memory access?
> >
>
> Depends. If you can convince PCI SIG to
> implement a coherent protocol the difficulty shouldn't be that high. Efficiency
> wouldn't be the greatest since the basic protocols for PCI-E are designed around
> large block transfers but it would be doable esp with the move to integrating
> PCI-E on die.
>
> Probably the easiest solution for something like MIC would be
> to integrate a QPI interface in addition to the PCI-E interface. At that point
> it primarily becomes an exercise in setting up the memory map reasonably/sanely.
> The MIC would need both a caching agent and a home coherency agent, but it
> should be possible to do some cut and paste.
>
> What would be more difficult in
> the QPI + PCI-E space is get maximum advantage out of it. You would ideally
> like to use both the QPI agent and the PCI-E agent for bulk DMA traffic while
> only using the QPI agent for coherent traffic. Using the QPI link for bulk DMA
> would likely take some work with the various DMA engines. For the MIC local
> coherent memory you would likely only make a subset of it available for coherent
> access from the CPU in order to simplify the performance requirements (likely a
> variable window size that is programmable) so that all memory accesses from the
> MIC to its local memory don't have to remote snoop though if you have the area
> available, you might be able to get away with an SRAM based directory (basically
> limited capacity coherency from the MIC to CPU aka evict to make space) as well.
>
>
> For the CPU memory, you would likely be unrestricted assuming that the CPU
> side used some form of directory.
>
> Total DMA bandwidth should be at least
> equal to 32x PCI-E. And for coherent access you are looking at a minimum of 16x
> PCI-E bandwidth.
>
> And from a practical standpoint you are going to want 2xQPI
> or QPI+PCI-E since it is unlikely that the market requirement will be there for
> the CPUs to have 3x 16x PCI-E. Though if your network interface chip runs over
> a single QPI link it might be viable, but I kinda see the ideal setup for a top
> end super as 1 QPI + 16x PCI-E to both the MIC/GPU and to the network interface.
> So you would be looking at ~32+GB/s (at current speeds, likely 64+ GB/s in the
> 2015 timeframe baed on PCI-E 4.0 announced goals) in and out of the CPU to both
> network and MIC for a total of 128 GB/s which means that memory bandwidth likely
> becomes you main bottleneck.
>
> Also lets not forget that by the time this
> happens we are likely going to see some form of stacked memory in reasonably
> wide use, which means that the CPUs will likely have 1-4 GB of ultra high
> bandwidth "cache".

It does not have to be a cache of main memory; the stacked DRAM could be (part of) main memory itself. Why not?


> Which if used right would provide enough bandwidth buffer to
> have the I/Os plus having the option to direct route to/from networkMIC would
> make the 102.4 GB/s CPU memory subsystem reasonable.
>
> The next big problem is
> going to be feeding the network bandwidth. With that type of IO capability you
> would need 9+ 4x FDR IB connections per node. And you're probably going to want
> a switchless topology, can that would be a lot of switches.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
New Article: Compute Efficiency 2012David Kanter07/25/12 12:37 AM
  New Article: Compute Efficiency 2012SHK07/25/12 01:31 AM
    New Article: Compute Efficiency 2012David Kanter07/25/12 01:42 AM
  New Article: Compute Efficiency 2012none07/25/12 02:18 AM
    New Article: Compute Efficiency 2012David Kanter07/25/12 10:25 AM
  GCN (NT)EBFE07/25/12 02:25 AM
    GCN - TFLOP DPjp08/09/12 12:58 PM
      GCN - TFLOP DPDavid Kanter08/09/12 02:32 PM
        GCN - TFLOP DPKevin G08/11/12 04:22 PM
      GCN - TFLOP DPEric08/09/12 04:12 PM
        GCN - TFLOP DPjp08/10/12 12:23 AM
          GCN - TFLOP DPEBFE08/12/12 07:27 PM
            GCN - TFLOP DPjp08/13/12 01:02 AM
              GCN - TFLOP DPEBFE08/13/12 06:45 PM
                GCN - TFLOP DPjp08/14/12 12:21 AM
  New Article: Compute Efficiency 2012Adrian07/25/12 03:39 AM
    New Article: Compute Efficiency 2012EBFE07/25/12 08:33 AM
    New Article: Compute Efficiency 2012David Kanter07/25/12 10:11 AM
  New Article: Compute Efficiency 2012sf07/25/12 05:46 AM
    New Article: Compute Efficiency 2012aaron spink07/25/12 08:08 AM
      New Article: Compute Efficiency 2012someone07/25/12 09:06 AM
    New Article: Compute Efficiency 2012David Kanter07/25/12 10:14 AM
      New Article: Compute Efficiency 2012EBFE07/26/12 01:27 AM
        BG/QDavid Kanter07/26/12 08:31 AM
          VR-ZONE KNC B0 leak, poor number?EBFE08/03/12 12:57 AM
            VR-ZONE KNC B0 leak, poor number?Eric08/03/12 06:59 AM
              VR-ZONE KNC B0 leak, poor number?EBFE08/04/12 05:37 AM
                VR-ZONE KNC B0 leak, poor number?aaron spink08/04/12 05:51 PM
                Leaks != productsDavid Kanter08/05/12 02:19 AM
                  Leaks != productsEBFE08/06/12 01:49 AM
                VR-ZONE KNC B0 leak, poor number?Eric08/05/12 09:37 AM
                  VR-ZONE KNC B0 leak, poor number?EBFE08/06/12 02:09 AM
                    VR-ZONE KNC B0 leak, poor number?aaron spink08/06/12 03:33 AM
                      VR-ZONE KNC B0 leak, poor number?jp08/07/12 02:08 AM
                        VR-ZONE KNC B0 leak, poor number?Eric08/07/12 03:58 AM
                          VR-ZONE KNC B0 leak, poor number?jp08/07/12 04:17 AM
                            VR-ZONE KNC B0 leak, poor number?Eric08/07/12 04:22 AM
                              VR-ZONE KNC B0 leak, poor number?anonymou508/07/12 08:43 AM
                            VR-ZONE KNC B0 leak, poor number?jp08/07/12 04:23 AM
                              VR-ZONE KNC B0 leak, poor number?aaron spink08/07/12 06:24 AM
                        VR-ZONE KNC B0 leak, poor number?aaron spink08/07/12 06:20 AM
                          VR-ZONE KNC B0 leak, poor number?jp08/07/12 10:22 AM
                            VR-ZONE KNC B0 leak, poor number?EduardoS08/07/12 02:15 PM
                        KNC has FMADavid Kanter08/07/12 08:17 AM
  New Article: Compute Efficiency 2012forestlaughing07/25/12 07:51 AM
    New Article: Compute Efficiency 2012Eric07/27/12 04:12 AM
      New Article: Compute Efficiency 2012hobold07/27/12 10:53 AM
        New Article: Compute Efficiency 2012Eric07/27/12 11:51 AM
          New Article: Compute Efficiency 2012hobold07/27/12 01:48 PM
            New Article: Compute Efficiency 2012Eric07/27/12 02:29 PM
        New Article: Compute Efficiency 2012anon07/29/12 01:25 AM
          New Article: Compute Efficiency 2012hobold07/29/12 10:53 AM
  Efficiency? No, lack of highly useful featuressomeone07/25/12 08:58 AM
    Best case for GPUsDavid Kanter07/25/12 10:28 AM
      Best case for GPUsfranzliszt07/25/12 12:39 PM
      Best case for GPUsChuck07/25/12 07:13 PM
        Best case for GPUsDavid Kanter07/25/12 08:45 PM
        Best case for GPUsEric07/27/12 04:51 AM
  Silverthorn data pointMichael S07/25/12 01:45 PM
    Silverthorn data pointDavid Kanter07/25/12 03:06 PM
  New Article: Compute Efficiency 2012Unununium07/25/12 04:55 PM
    New Article: Compute Efficiency 2012EduardoS07/25/12 07:12 PM
      Ops... I'm wrong...EduardoS07/25/12 07:14 PM
  New Article: Compute Efficiency 2012TacoBell07/25/12 07:36 PM
    New Article: Compute Efficiency 2012David Kanter07/25/12 08:49 PM
    New Article: Compute Efficiency 2012Michael S07/26/12 01:33 AM
  Line and factorMoritz07/26/12 12:34 AM
    Line and factorPeter Boyle07/27/12 06:57 AM
      not entirelyMoritz07/27/12 11:22 AM
      Line and factorEduardoS07/27/12 04:24 PM
        Line and factorMoritz07/28/12 11:52 AM
  tables Michael S07/26/12 01:39 AM
  Interlagos L2+L3Rana07/26/12 02:13 AM
    Interlagos L2+L3Rana07/26/12 02:13 AM
    Interlagos L2+L3David Kanter07/26/12 08:21 AM
      SP vs DP & performance metricsjp07/27/12 06:08 AM
        SP vs DP & performance metricsEric07/27/12 06:57 AM
          SP vs DP & performance metricsjp07/27/12 08:18 AM
            SP vs DP & performance metricsaaron spink07/27/12 08:36 AM
              SP vs DP & performance metricsjp07/27/12 08:47 AM
                "Global" --> systemPaul A. Clayton07/27/12 09:31 AM
                  "Global" --> systemjp07/27/12 02:55 PM
                    "Global" --> systemaaron spink07/27/12 06:33 PM
                      "Global" --> systemjp07/28/12 01:00 AM
                        "Global" --> systemaaron spink07/28/12 05:54 AM
                          "Global" --> systemjp07/29/12 01:12 AM
                            "Global" --> systemaaron spink07/29/12 04:03 AM
                              "Global" --> systemnone07/29/12 08:05 AM
                                "Global" --> systemEduardoS07/29/12 09:26 AM
                                "Global" --> systemjp07/30/12 01:24 AM
                                  "Global" --> systemaaron spink07/30/12 02:05 AM
                                "Global" --> systemaaron spink07/30/12 02:03 AM
                                  daxpy is STREAM TRIADPaul A. Clayton07/30/12 05:10 AM
                SP vs DP & performance metricsaaron spink07/27/12 06:25 PM
                  SP vs DP & performance metricsEmil Briggs07/28/12 05:40 AM
                    SP vs DP & performance metricsaaron spink07/28/12 06:05 AM
                      SP vs DP & performance metricsjp07/28/12 10:04 AM
                        SP vs DP & performance metricsBrett07/28/12 02:32 PM
                      SP vs DP & performance metricsEmil Briggs07/28/12 05:11 PM
                        SP vs DP & performance metricsanon07/29/12 01:53 AM
                        SP vs DP & performance metricsaaron spink07/29/12 04:39 AM
                          Coherency for discretesRohit07/29/12 08:24 AM
                          SP vs DP & performance metricsanon07/29/12 10:09 AM
                          SP vs DP & performance metricsEric07/29/12 12:08 PM
        SP vs DP & performance metricsaaron spink07/27/12 08:25 AM
  Regular updates?Joe07/27/12 08:35 AM
  New Article: Compute Efficiency 201230907/27/12 09:34 PM
  New Article: Compute Efficiency 2012Ingeneer07/30/12 08:01 AM
    New Article: Compute Efficiency 2012David Kanter07/30/12 12:11 PM
      New Article: Compute Efficiency 2012Ingeneer07/30/12 07:04 PM
        New Article: Compute Efficiency 2012David Kanter07/30/12 08:32 PM
          Memory power and bandwidth?Iain McClatchie08/03/12 03:35 PM
            Memory power and bandwidth?David Kanter08/04/12 10:22 AM
              Memory power and bandwidth?Michael S08/04/12 01:36 PM
              Memory power and bandwidth?Iain McClatchie08/06/12 01:09 PM
              Memory power and bandwidth?Eric08/07/12 05:28 PM
                WorkloadsDavid Kanter08/08/12 09:49 AM
                  WorkloadsEric08/09/12 04:21 PM
                Latency and bandwidth bottlenecks Paul A. Clayton08/08/12 03:02 PM
                  Latency and bandwidth bottlenecks Eric08/09/12 04:32 PM
                    Latency and bandwidth bottlenecks none08/10/12 05:06 AM
                  Latency and bandwidth bottlenecks -> BDPajensen08/11/12 02:21 PM
            Memory power and bandwidth?Ingeneer08/06/12 10:26 AM
  NV aims for 1.8+ TFLOPS DP ?jp08/11/12 12:21 PM
    NV aims for 1.8+ TFLOPS DP ?David Kanter08/11/12 08:25 PM
      NV aims for 1.8+ TFLOPS DP ?jp08/12/12 01:45 AM
      NV aims for 1.8+ TFLOPS DP ?EBFE08/12/12 09:02 PM
        NV aims for 1.8+ TFLOPS DP ?jp08/13/12 12:54 AM
          NV aims for 1.8+ TFLOPS DP ?Gabriele Svelto08/13/12 08:16 AM
            NV aims for 1.8+ TFLOPS DP ?Vincent Diepeveen08/14/12 02:04 AM
          NV aims for 1.8+ TFLOPS DP ?David Kanter08/13/12 08:50 AM
            NV aims for 1.8+ TFLOPS DP ?jp08/13/12 10:17 AM
        NV aims for 1.8+ TFLOPS DP ?EduardoS08/13/12 05:45 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell blue?