By: David Kanter (dkanter.delete@this.realworldtech.com), August 15, 2011 6:24 pm
Room: Moderated Discussions
Mark Roulo (nothanks@xxx.com) on 8/15/11 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 8/10/11 wrote:
>---------------------------
>>AFAIK, Intel has not exposed any shared memory to SW, which is required for OpenCL.
>>They could use the L3 cache for shared memory, but the performance seems like it
>>would be pretty awful due to high latency.
>
>Shared memory is a logical, not physical, concept in a cache-coherent system.
>The L1 would probably wind up being used for typical OpenCL codes (the mapping being one nVidia SM -> 1 x86 core).
I was speaking of the Sandy Bridge GPU. The actual CPU cores should have no problems running OpenCL - as you pointed out, the L1 cache is sufficiently sized, and you don't even need it. Each SIMD lane can communicate fairly easily since the registers are shared.
>>I also wonder about numerical accuracy.
>
>Do we expect x86 numerics to be *WORSE* than GPU numerics?
No, but I suspect the SNB GPU may have worse numerics than Nvidia/AMD GPUs.
>>It's also possible that OpenCL is feasible, but has such abhorrent performance
>>that they judged it better to simply wait for the next generation.
>
>My expectation is that OpenCL performance on x86 is a >*long* way from being performance
>competitive with hand coded SSE/AVX intrinsics. Intel's >marketing question would
>be, "Is there a large enough market for performance >willing to code in OpenCL, but unwilling to use >intrinsics?"
Agreed. Although I think Matt Pharr and the other folks in ART are trying to give you *similar* performance, without using instrinsics.
David
---------------------------
>David Kanter (dkanter@realworldtech.com) on 8/10/11 wrote:
>---------------------------
>>AFAIK, Intel has not exposed any shared memory to SW, which is required for OpenCL.
>>They could use the L3 cache for shared memory, but the performance seems like it
>>would be pretty awful due to high latency.
>
>Shared memory is a logical, not physical, concept in a cache-coherent system.
>The L1 would probably wind up being used for typical OpenCL codes (the mapping being one nVidia SM -> 1 x86 core).
I was speaking of the Sandy Bridge GPU. The actual CPU cores should have no problems running OpenCL - as you pointed out, the L1 cache is sufficiently sized, and you don't even need it. Each SIMD lane can communicate fairly easily since the registers are shared.
>>I also wonder about numerical accuracy.
>
>Do we expect x86 numerics to be *WORSE* than GPU numerics?
No, but I suspect the SNB GPU may have worse numerics than Nvidia/AMD GPUs.
>>It's also possible that OpenCL is feasible, but has such abhorrent performance
>>that they judged it better to simply wait for the next generation.
>
>My expectation is that OpenCL performance on x86 is a >*long* way from being performance
>competitive with hand coded SSE/AVX intrinsics. Intel's >marketing question would
>be, "Is there a large enough market for performance >willing to code in OpenCL, but unwilling to use >intrinsics?"
Agreed. Although I think Matt Pharr and the other folks in ART are trying to give you *similar* performance, without using instrinsics.
David