It is difficult to runtime optimize away the difference between a CPU and GPU

Article: Introduction to OpenCL
By: Mark Roulo (nothanks.delete@this.xxx.com), December 10, 2010 7:50 pm
Room: Moderated Discussions
ltcommander.data (ltcommander.tuvok@gmail.com) on 12/10/10 wrote:
---------------------------
>Bryan Catanzaro (name@gmail.com) on 12/10/10 wrote:
>---------------------------
>>You mention in the article that vectorizing work-items is good for AMD, but not
>>so important for Nvidia. I would actually state this more strongly: vectorized
>>work-items often incur significant performance slowdown on Nvidia chips. This happens
>>because vectorization increases the register file working set size for each work-item,
>>and Nvidia's architecture assumes the register file requirements of each work-item are small.
>>
>>For example, I've seen 3x performance slowdown using vectorized OpenCL code which
>>assumes a 4-wide work-item, while running on Nvidia chips.
>>
>>More broadly, OpenCL performance portability is a very difficult task. I prefer
>>to think of OpenCL as providing a unified set of abstractions to program a variety
>>of parallel architectures, but in practice I think getting good performance requires
>>recoding for each architecture you're targeting. In other words, OpenCL solves
>>the problem of each vendor having a proprietary set of abstractions to target proprietary
>>hardware (ie, CUDA/Brook+, etc.), but it doesn't address the performance portability problem at all.
>
>Can any of this be addressed by LLVM? I thought that was supposed to allow for
>machine specific optimization at runtime?

It will be very difficult.

Imagine that you want to do a sequence of things to some data. Say operations A, B, C, D, and E. Assume that these operations are moderately complex (apply some sort of local filter, for example) and not just 'add.'

For some HPC configurations (like GPUs) that tend to have high bandwidth but small cache/shared-memory configurations, you would want to stream data through multiple times and do one operation each time through. As an example, on pre-Fermi nVidia GPUs you would have 16 KB of shared memory per streaming-multiprocessor/core, but you'd want several blocks to be running on each SM/core. The upshot is that you'd like to keep your working set in the 4-8KB range. Things are better on Fermi, but the point still holds.

For other HPC configurations (like CPUs) with larger caches and less bandwidth, you want to stream the data through *once* and perform A - E on each piece/tile/region as the data streams through once. Here you have an L1 D-cache of 32KB or 64KB and and L2 cache in the 256KB range.

So, a scheme that minimizes bandwidth, but requires a working set in the 128KB range is best for a CPU, but a scheme that increases the required bandwidth a lot, but keeps the working set down to 8-24KB is best for the GPU.

You wind up writing *very* different loops (and a different number of loops) to get optimal performance for the two systems.

Asking a runtime to perform this sort of runtime optimization is asking a LOT.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
OpenCL article onlineDavid Kanter2010/12/09 02:44 AM
  OpenCL article onlineXN2010/12/09 06:33 AM
    OpenCL article onlineDavid Kanter2010/12/09 01:54 PM
  OpenCL article onlineanon2010/12/09 02:33 PM
    OpenCL article onlineDavid Kanter2010/12/09 02:38 PM
      OpenCL article onlineIan Ameline2010/12/09 03:47 PM
      OpenCL article onlineAnon2010/12/09 08:27 PM
        OpenCL article onlineDavid Kanter2010/12/09 10:58 PM
  Performance portabilityBryan Catanzaro2010/12/10 12:43 PM
    Performance portabilityltcommander.data2010/12/10 07:11 PM
      It is difficult to runtime optimize away the difference between a CPU and GPUMark Roulo2010/12/10 07:50 PM
        It is difficult to runtime optimize away the difference between a CPU and GPUhobold2010/12/11 03:35 AM
          It is difficult to runtime optimize away the difference between a CPU and GPUMark Roulo2010/12/12 01:20 PM
            It is difficult to runtime optimize away the difference between a CPU and GPUhobold2010/12/12 03:31 PM
              It is difficult to runtime optimize away the difference between a CPU and GPUanon2010/12/12 04:24 PM
                It is difficult to runtime optimize away the difference between a CPU and GPUhobold2010/12/13 03:44 AM
        Specially when the language provides almost no hardware abstraction (NT)EduardoS2010/12/11 10:53 AM
    Performance portabilityWainwright2010/12/11 03:44 PM
      Performance portabilityEduardoS2010/12/11 03:57 PM
        Performance portabilityWainwright2010/12/11 04:02 PM
          Performance portabilityEduardoS2010/12/11 08:20 PM
            Performance portabilityWainwright2010/12/12 02:22 AM
      Performance portabilityDavid Kanter2010/12/11 05:53 PM
        Performance portabilityEduardoS2010/12/11 08:23 PM
          Performance portabilityDavid Kanter2010/12/11 09:06 PM
            Performance portabilityWainwright2010/12/12 02:26 AM
            Performance portabilityEduardoS2010/12/12 09:04 AM
  OpenCL article onlineAlan Commike2010/12/14 01:01 PM
  OpenCL - why are there any pointers at all?Rob Thorpe2010/12/16 03:45 AM
    OpenCL - why are there any pointers at all?EduardoS2010/12/16 01:51 PM
      OpenCL - why are there any pointers at all?Rob Thorpe2010/12/17 03:19 AM
        OpenCL - why are there any pointers at all?Richard Cownie2010/12/17 07:02 AM
          OpenCL - why are there any pointers at all?Rob Thorpe2010/12/17 08:29 AM
            OpenCL - why are there any pointers at all?Richard Cownie2010/12/17 09:13 AM
              OpenCL - why are there any pointers at all?Rob Thorpe2010/12/17 10:03 AM
                OpenCL - why are there any pointers at all?Richard Cownie2010/12/17 10:53 AM
                  OpenCL - why are there any pointers at all?Rob Thorpe2010/12/17 11:19 AM
                    OpenCL - why are there any pointers at all?Richard Cownie2010/12/17 11:51 AM
                OpenCL - why are there any pointers at all?hobold2010/12/17 11:06 AM
          OpenCL - why are there any pointers at all?EduardoS2010/12/18 07:58 AM
            OpenCL - why are there any pointers at all?anon2010/12/18 10:27 AM
            OpenCL - why are there any pointers at all?BorisG2010/12/18 10:33 AM
              OpenCL - why are there any pointers at all?Richard Cownie2010/12/18 02:39 PM
  OpenCL article onlineEmil Briggs2010/12/19 06:40 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?