SP vs DP & performance metrics

Article: Computational Efficiency for CPUs and GPUs in 2012
By: anon (anon.delete@this.anon.com), July 29, 2012 11:09 am
Room: Moderated Discussions
aaron spink (aaronspink.delete@this.notearthlink.net) on July 29, 2012 5:39 am wrote:
> Emil Briggs (me.delete@this.nowherespam.com) on July 28, 2012 6:11 pm
> wrote:
>
> > There are two places where we are
> > using GPU's. One of them
> consists of large matrix operations and are done using
> > the Nvidia cublas
> library. Those run close to 80% of peak. The tricky part of
> > the work here
> is keeping the CPU's busy doing something useful while moving the
> >
> matrices back and forth to the GPU. The other place is some finite difference
>
> > routines. Still working on this. It's faster than doing it all on the
> CPU's but
> > not by much. I'm trying to get more overlap between the CPU and
> GPU's here but
> > this section of the code is not as suitable for that as
> the first.
> >
>
> 80% is very very good. Esp compared to linpack.
>
> >
>
> > Agreed. How difficult do you think it would be to implement
> >
> coherent memory access?
> >
>
> Depends. If you can convince PCI SIG to
> implement a coherent protocol the difficulty shouldn't be that high. Efficiency
> wouldn't be the greatest since the basic protocols for PCI-E are designed around
> large block transfers but it would be doable esp with the move to integrating
> PCI-E on die.
>
> Probably the easiest solution for something like MIC would be
> to integrate a QPI interface in addition to the PCI-E interface. At that point
> it primarily becomes an exercise in setting up the memory map reasonably/sanely.
> The MIC would need both a caching agent and a home coherency agent, but it
> should be possible to do some cut and paste.
>
> What would be more difficult in
> the QPI + PCI-E space is get maximum advantage out of it. You would ideally
> like to use both the QPI agent and the PCI-E agent for bulk DMA traffic while
> only using the QPI agent for coherent traffic. Using the QPI link for bulk DMA
> would likely take some work with the various DMA engines. For the MIC local
> coherent memory you would likely only make a subset of it available for coherent
> access from the CPU in order to simplify the performance requirements (likely a
> variable window size that is programmable) so that all memory accesses from the
> MIC to its local memory don't have to remote snoop though if you have the area
> available, you might be able to get away with an SRAM based directory (basically
> limited capacity coherency from the MIC to CPU aka evict to make space) as well.
>
>
> For the CPU memory, you would likely be unrestricted assuming that the CPU
> side used some form of directory.
>
> Total DMA bandwidth should be at least
> equal to 32x PCI-E. And for coherent access you are looking at a minimum of 16x
> PCI-E bandwidth.
>
> And from a practical standpoint you are going to want 2xQPI
> or QPI+PCI-E since it is unlikely that the market requirement will be there for
> the CPUs to have 3x 16x PCI-E. Though if your network interface chip runs over
> a single QPI link it might be viable, but I kinda see the ideal setup for a top
> end super as 1 QPI + 16x PCI-E to both the MIC/GPU and to the network interface.
> So you would be looking at ~32+GB/s (at current speeds, likely 64+ GB/s in the
> 2015 timeframe baed on PCI-E 4.0 announced goals) in and out of the CPU to both
> network and MIC for a total of 128 GB/s which means that memory bandwidth likely
> becomes you main bottleneck.
>
> Also lets not forget that by the time this
> happens we are likely going to see some form of stacked memory in reasonably
> wide use, which means that the CPUs will likely have 1-4 GB of ultra high
> bandwidth "cache".

It does not have to be a cache of main memory; the stacked DRAM could be (part of) main memory itself. Why not?


> Which if used right would provide enough bandwidth buffer to
> have the I/Os plus having the option to direct route to/from networkMIC would
> make the 102.4 GB/s CPU memory subsystem reasonable.
>
> The next big problem is
> going to be feeding the network bandwidth. With that type of IO capability you
> would need 9+ 4x FDR IB connections per node. And you're probably going to want
> a switchless topology, can that would be a lot of switches.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
New Article: Compute Efficiency 2012David Kanter2012/07/25 01:37 AM
  New Article: Compute Efficiency 2012SHK2012/07/25 02:31 AM
    New Article: Compute Efficiency 2012David Kanter2012/07/25 02:42 AM
  New Article: Compute Efficiency 2012none2012/07/25 03:18 AM
    New Article: Compute Efficiency 2012David Kanter2012/07/25 11:25 AM
  GCN (NT)EBFE2012/07/25 03:25 AM
    GCN - TFLOP DPjp2012/08/09 01:58 PM
      GCN - TFLOP DPDavid Kanter2012/08/09 03:32 PM
        GCN - TFLOP DPKevin G2012/08/11 05:22 PM
      GCN - TFLOP DPEric2012/08/09 05:12 PM
        GCN - TFLOP DPjp2012/08/10 01:23 AM
          GCN - TFLOP DPEBFE2012/08/12 08:27 PM
            GCN - TFLOP DPjp2012/08/13 02:02 AM
              GCN - TFLOP DPEBFE2012/08/13 07:45 PM
                GCN - TFLOP DPjp2012/08/14 01:21 AM
  New Article: Compute Efficiency 2012Adrian2012/07/25 04:39 AM
    New Article: Compute Efficiency 2012EBFE2012/07/25 09:33 AM
    New Article: Compute Efficiency 2012David Kanter2012/07/25 11:11 AM
  New Article: Compute Efficiency 2012sf2012/07/25 06:46 AM
    New Article: Compute Efficiency 2012aaron spink2012/07/25 09:08 AM
      New Article: Compute Efficiency 2012someone2012/07/25 10:06 AM
    New Article: Compute Efficiency 2012David Kanter2012/07/25 11:14 AM
      New Article: Compute Efficiency 2012EBFE2012/07/26 02:27 AM
        BG/QDavid Kanter2012/07/26 09:31 AM
          VR-ZONE KNC B0 leak, poor number?EBFE2012/08/03 01:57 AM
            VR-ZONE KNC B0 leak, poor number?Eric2012/08/03 07:59 AM
              VR-ZONE KNC B0 leak, poor number?EBFE2012/08/04 06:37 AM
                VR-ZONE KNC B0 leak, poor number?aaron spink2012/08/04 06:51 PM
                Leaks != productsDavid Kanter2012/08/05 03:19 AM
                  Leaks != productsEBFE2012/08/06 02:49 AM
                VR-ZONE KNC B0 leak, poor number?Eric2012/08/05 10:37 AM
                  VR-ZONE KNC B0 leak, poor number?EBFE2012/08/06 03:09 AM
                    VR-ZONE KNC B0 leak, poor number?aaron spink2012/08/06 04:33 AM
                      VR-ZONE KNC B0 leak, poor number?jp2012/08/07 03:08 AM
                        VR-ZONE KNC B0 leak, poor number?Eric2012/08/07 04:58 AM
                          VR-ZONE KNC B0 leak, poor number?jp2012/08/07 05:17 AM
                            VR-ZONE KNC B0 leak, poor number?Eric2012/08/07 05:22 AM
                              VR-ZONE KNC B0 leak, poor number?anonymou52012/08/07 09:43 AM
                            VR-ZONE KNC B0 leak, poor number?jp2012/08/07 05:23 AM
                              VR-ZONE KNC B0 leak, poor number?aaron spink2012/08/07 07:24 AM
                        VR-ZONE KNC B0 leak, poor number?aaron spink2012/08/07 07:20 AM
                          VR-ZONE KNC B0 leak, poor number?jp2012/08/07 11:22 AM
                            VR-ZONE KNC B0 leak, poor number?EduardoS2012/08/07 03:15 PM
                        KNC has FMADavid Kanter2012/08/07 09:17 AM
  New Article: Compute Efficiency 2012forestlaughing2012/07/25 08:51 AM
    New Article: Compute Efficiency 2012Eric2012/07/27 05:12 AM
      New Article: Compute Efficiency 2012hobold2012/07/27 11:53 AM
        New Article: Compute Efficiency 2012Eric2012/07/27 12:51 PM
          New Article: Compute Efficiency 2012hobold2012/07/27 02:48 PM
            New Article: Compute Efficiency 2012Eric2012/07/27 03:29 PM
        New Article: Compute Efficiency 2012anon2012/07/29 02:25 AM
          New Article: Compute Efficiency 2012hobold2012/07/29 11:53 AM
  Efficiency? No, lack of highly useful featuressomeone2012/07/25 09:58 AM
    Best case for GPUsDavid Kanter2012/07/25 11:28 AM
      Best case for GPUsfranzliszt2012/07/25 01:39 PM
      Best case for GPUsChuck2012/07/25 08:13 PM
        Best case for GPUsDavid Kanter2012/07/25 09:45 PM
        Best case for GPUsEric2012/07/27 05:51 AM
  Silverthorn data pointMichael S2012/07/25 02:45 PM
    Silverthorn data pointDavid Kanter2012/07/25 04:06 PM
  New Article: Compute Efficiency 2012Unununium2012/07/25 05:55 PM
    New Article: Compute Efficiency 2012EduardoS2012/07/25 08:12 PM
      Ops... I'm wrong...EduardoS2012/07/25 08:14 PM
  New Article: Compute Efficiency 2012TacoBell2012/07/25 08:36 PM
    New Article: Compute Efficiency 2012David Kanter2012/07/25 09:49 PM
    New Article: Compute Efficiency 2012Michael S2012/07/26 02:33 AM
  Line and factorMoritz2012/07/26 01:34 AM
    Line and factorPeter Boyle2012/07/27 07:57 AM
      not entirelyMoritz2012/07/27 12:22 PM
      Line and factorEduardoS2012/07/27 05:24 PM
        Line and factorMoritz2012/07/28 12:52 PM
  tables Michael S2012/07/26 02:39 AM
  Interlagos L2+L3Rana2012/07/26 03:13 AM
    Interlagos L2+L3Rana2012/07/26 03:13 AM
    Interlagos L2+L3David Kanter2012/07/26 09:21 AM
      SP vs DP & performance metricsjp2012/07/27 07:08 AM
        SP vs DP & performance metricsEric2012/07/27 07:57 AM
          SP vs DP & performance metricsjp2012/07/27 09:18 AM
            SP vs DP & performance metricsaaron spink2012/07/27 09:36 AM
              SP vs DP & performance metricsjp2012/07/27 09:47 AM
                "Global" --> systemPaul A. Clayton2012/07/27 10:31 AM
                  "Global" --> systemjp2012/07/27 03:55 PM
                    "Global" --> systemaaron spink2012/07/27 07:33 PM
                      "Global" --> systemjp2012/07/28 02:00 AM
                        "Global" --> systemaaron spink2012/07/28 06:54 AM
                          "Global" --> systemjp2012/07/29 02:12 AM
                            "Global" --> systemaaron spink2012/07/29 05:03 AM
                              "Global" --> systemnone2012/07/29 09:05 AM
                                "Global" --> systemEduardoS2012/07/29 10:26 AM
                                "Global" --> systemjp2012/07/30 02:24 AM
                                  "Global" --> systemaaron spink2012/07/30 03:05 AM
                                "Global" --> systemaaron spink2012/07/30 03:03 AM
                                  daxpy is STREAM TRIADPaul A. Clayton2012/07/30 06:10 AM
                SP vs DP & performance metricsaaron spink2012/07/27 07:25 PM
                  SP vs DP & performance metricsEmil Briggs2012/07/28 06:40 AM
                    SP vs DP & performance metricsaaron spink2012/07/28 07:05 AM
                      SP vs DP & performance metricsjp2012/07/28 11:04 AM
                        SP vs DP & performance metricsBrett2012/07/28 03:32 PM
                      SP vs DP & performance metricsEmil Briggs2012/07/28 06:11 PM
                        SP vs DP & performance metricsanon2012/07/29 02:53 AM
                        SP vs DP & performance metricsaaron spink2012/07/29 05:39 AM
                          Coherency for discretesRohit2012/07/29 09:24 AM
                          SP vs DP & performance metricsanon2012/07/29 11:09 AM
                          SP vs DP & performance metricsEric2012/07/29 01:08 PM
        SP vs DP & performance metricsaaron spink2012/07/27 09:25 AM
  Regular updates?Joe2012/07/27 09:35 AM
  New Article: Compute Efficiency 20123092012/07/27 10:34 PM
  New Article: Compute Efficiency 2012Ingeneer2012/07/30 09:01 AM
    New Article: Compute Efficiency 2012David Kanter2012/07/30 01:11 PM
      New Article: Compute Efficiency 2012Ingeneer2012/07/30 08:04 PM
        New Article: Compute Efficiency 2012David Kanter2012/07/30 09:32 PM
          Memory power and bandwidth?Iain McClatchie2012/08/03 04:35 PM
            Memory power and bandwidth?David Kanter2012/08/04 11:22 AM
              Memory power and bandwidth?Michael S2012/08/04 02:36 PM
              Memory power and bandwidth?Iain McClatchie2012/08/06 02:09 PM
              Memory power and bandwidth?Eric2012/08/07 06:28 PM
                WorkloadsDavid Kanter2012/08/08 10:49 AM
                  WorkloadsEric2012/08/09 05:21 PM
                Latency and bandwidth bottlenecks Paul A. Clayton2012/08/08 04:02 PM
                  Latency and bandwidth bottlenecks Eric2012/08/09 05:32 PM
                    Latency and bandwidth bottlenecks none2012/08/10 06:06 AM
                  Latency and bandwidth bottlenecks -> BDPajensen2012/08/11 03:21 PM
            Memory power and bandwidth?Ingeneer2012/08/06 11:26 AM
  NV aims for 1.8+ TFLOPS DP ?jp2012/08/11 01:21 PM
    NV aims for 1.8+ TFLOPS DP ?David Kanter2012/08/11 09:25 PM
      NV aims for 1.8+ TFLOPS DP ?jp2012/08/12 02:45 AM
      NV aims for 1.8+ TFLOPS DP ?EBFE2012/08/12 10:02 PM
        NV aims for 1.8+ TFLOPS DP ?jp2012/08/13 01:54 AM
          NV aims for 1.8+ TFLOPS DP ?Gabriele Svelto2012/08/13 09:16 AM
            NV aims for 1.8+ TFLOPS DP ?Vincent Diepeveen2012/08/14 03:04 AM
          NV aims for 1.8+ TFLOPS DP ?David Kanter2012/08/13 09:50 AM
            NV aims for 1.8+ TFLOPS DP ?jp2012/08/13 11:17 AM
        NV aims for 1.8+ TFLOPS DP ?EduardoS2012/08/13 06:45 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?