By: Mark Roulo (nothanks.delete@this.xxx.com), August 16, 2011 6:47 am
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 8/15/11 wrote:
---------------------------
>Mark Roulo (nothanks@xxx.com) on 8/15/11 wrote:
>---------------------------
>>David Kanter (dkanter@realworldtech.com) on 8/10/11 wrote:
>>---------------------------
>>>AFAIK, Intel has not exposed any shared memory to SW, which is required for OpenCL.
>>>They could use the L3 cache for shared memory, but the performance seems like it
>>>would be pretty awful due to high latency.
>>
>>Shared memory is a logical, not physical, concept in a cache-coherent system.
>>The L1 would probably wind up being used for typical OpenCL codes (the mapping being one nVidia SM -> 1 x86 core).
>
>I was speaking of the Sandy Bridge GPU.
Oh! Whoops.
>>>I also wonder about numerical accuracy.
>>
>>Do we expect x86 numerics to be *WORSE* than GPU numerics?
>
>No, but I suspect the SNB GPU may have worse numerics than Nvidia/AMD GPUs.
This would not surprise me, either.
>Agreed. Although I think Matt Pharr and the other folks in ART are trying to give
>you *similar* performance, without using instrinsics.
Folks are trying. I've spent time in the last few years dealing with nVidia/CUDA claims. We've coded for nVidia/CUDA, ATI/OpenCL and evaluated RapidMind (now Intel Array Building Blocks), and Intel Thread Building Blocks on x86.
None of them come anywhere close to their performance claims on our loads (which, I grant, are not single precision floating point ...).
We couldn't even get ABB to beat ICC compiling *scalar* code and auto-vectorization turned on, although ABB beat GCC by a few tens of percent. RapidMind failed just as badly a few years ago when we evaluated them and asked *them* to write the RapidMind code for us.
So ... I'm skeptical about OpenCL on x86 beating hand coded intrinsics in general any time soon.
---------------------------
>Mark Roulo (nothanks@xxx.com) on 8/15/11 wrote:
>---------------------------
>>David Kanter (dkanter@realworldtech.com) on 8/10/11 wrote:
>>---------------------------
>>>AFAIK, Intel has not exposed any shared memory to SW, which is required for OpenCL.
>>>They could use the L3 cache for shared memory, but the performance seems like it
>>>would be pretty awful due to high latency.
>>
>>Shared memory is a logical, not physical, concept in a cache-coherent system.
>>The L1 would probably wind up being used for typical OpenCL codes (the mapping being one nVidia SM -> 1 x86 core).
>
>I was speaking of the Sandy Bridge GPU.
Oh! Whoops.
>>>I also wonder about numerical accuracy.
>>
>>Do we expect x86 numerics to be *WORSE* than GPU numerics?
>
>No, but I suspect the SNB GPU may have worse numerics than Nvidia/AMD GPUs.
This would not surprise me, either.
>Agreed. Although I think Matt Pharr and the other folks in ART are trying to give
>you *similar* performance, without using instrinsics.
Folks are trying. I've spent time in the last few years dealing with nVidia/CUDA claims. We've coded for nVidia/CUDA, ATI/OpenCL and evaluated RapidMind (now Intel Array Building Blocks), and Intel Thread Building Blocks on x86.
None of them come anywhere close to their performance claims on our loads (which, I grant, are not single precision floating point ...).
We couldn't even get ABB to beat ICC compiling *scalar* code and auto-vectorization turned on, although ABB beat GCC by a few tens of percent. RapidMind failed just as badly a few years ago when we evaluated them and asked *them* to write the RapidMind code for us.
So ... I'm skeptical about OpenCL on x86 beating hand coded intrinsics in general any time soon.