Article: Parallelism at HotPar 2010
By: Steve Underwood (steveu.delete@this.coppice.org), August 23, 2010 2:25 am
Room: Moderated Discussions
Anon (no@thanks.com) on 8/22/10 wrote:
---------------------------
>Steve Underwood (steevu@coppice.org) on 8/22/10 wrote:
>---------------------------
>>hobold (hobold@vectorizer.org) on 8/19/10 wrote:
>>---------------------------
>>>Steve Underwood (steveu@coppice.org) on 8/18/10 wrote:
>>>---------------------------
>>>[...]
>>>>It took a lot of time for people to get the best out of hand shuffling
>>>>things with SSSE3, and the next generation core made this complexity something that
>>>>needs to be ripped out of the code. AAAHHHHH!
>>>>
>>>
>>>http://www.khronos.org/developers/library/2010_siggraph_bof_opencl/OpenCL-BOF-Intel-SIGGRAPH-Jul10.pdf
>
>This starts to show where NVidia (for all their faults) has been pushing hard -
>they already have reasonably OpenCL and other compilers targeting deep/wide systems.
>
>Intel are going there also, which is of course a great thing.
>
>It does however look to me like Intel is playing the same game people seem incensed
>at NVidia for doing here, unless I have missed something their base C implementation
>is running single threaded on a single core, versus 4/8 SMD units for their OpenCL version..
>In some ways I agree with that view as part of the 'job' is parallelising the implementation,
>and OpenCL makes that easier (that C) (once you get around to understanding it..).
>However, people jumped all over that view with the NVidia comparisons.. so I guess it applies here also.
>But again, it is really a non-optimised C versus an optimised SSE/Threaded implementation.
These are two different things.
Intel are showing the capabilities of OpenCL. That you can take a naive single threaded C implementation, and drive it up to nearly the hand tuned capabilities of the hardware, in a quick to code and fairly portable way. One problem doesn't really illustrate these capabilities very well. A selection of diverse number crunching problems would give a more meaningful picture, but the Intel document is fine as far as it goes.
People complain about the nVidia documents because they try to imply the GPU is capable of X times the speed of the CPU (for huge values of X). However, they only use a fraction of the CPU's capabilities, or they use an ancient CPU (sometimes both). If you look through a number these comparison documents, they follow the standard pattern of work produced for hire by academics. The first couple of pages show some exciting figures, to please the sponsor. Further in you find the detailed material making more meaningful comparisons. The academics must include that, to maintain credibility and be able to face themselves in the mirror each day. I seldom have any complaints about the detailed material. If you look through it properly, it usually implies a worthwhile speedup for a problem that fits the GPU well. Its never the spectacular improvement from the cover page, though.
It will be interesting to see how much the latest, more generalised, GPUs widen the range of problems that speed up significantly. It seems they should be appearing about now. If they start showing worthwhile gains for problems involving numerous short vectors (e.g. the endless 40 element vectors in the G.729 speech codec), I might start to get really interested in them.
Steve
---------------------------
>Steve Underwood (steevu@coppice.org) on 8/22/10 wrote:
>---------------------------
>>hobold (hobold@vectorizer.org) on 8/19/10 wrote:
>>---------------------------
>>>Steve Underwood (steveu@coppice.org) on 8/18/10 wrote:
>>>---------------------------
>>>[...]
>>>>It took a lot of time for people to get the best out of hand shuffling
>>>>things with SSSE3, and the next generation core made this complexity something that
>>>>needs to be ripped out of the code. AAAHHHHH!
>>>>
>>>
>>>http://www.khronos.org/developers/library/2010_siggraph_bof_opencl/OpenCL-BOF-Intel-SIGGRAPH-Jul10.pdf
>
>This starts to show where NVidia (for all their faults) has been pushing hard -
>they already have reasonably OpenCL and other compilers targeting deep/wide systems.
>
>Intel are going there also, which is of course a great thing.
>
>It does however look to me like Intel is playing the same game people seem incensed
>at NVidia for doing here, unless I have missed something their base C implementation
>is running single threaded on a single core, versus 4/8 SMD units for their OpenCL version..
>In some ways I agree with that view as part of the 'job' is parallelising the implementation,
>and OpenCL makes that easier (that C) (once you get around to understanding it..).
>However, people jumped all over that view with the NVidia comparisons.. so I guess it applies here also.
>But again, it is really a non-optimised C versus an optimised SSE/Threaded implementation.
These are two different things.
Intel are showing the capabilities of OpenCL. That you can take a naive single threaded C implementation, and drive it up to nearly the hand tuned capabilities of the hardware, in a quick to code and fairly portable way. One problem doesn't really illustrate these capabilities very well. A selection of diverse number crunching problems would give a more meaningful picture, but the Intel document is fine as far as it goes.
People complain about the nVidia documents because they try to imply the GPU is capable of X times the speed of the CPU (for huge values of X). However, they only use a fraction of the CPU's capabilities, or they use an ancient CPU (sometimes both). If you look through a number these comparison documents, they follow the standard pattern of work produced for hire by academics. The first couple of pages show some exciting figures, to please the sponsor. Further in you find the detailed material making more meaningful comparisons. The academics must include that, to maintain credibility and be able to face themselves in the mirror each day. I seldom have any complaints about the detailed material. If you look through it properly, it usually implies a worthwhile speedup for a problem that fits the GPU well. Its never the spectacular improvement from the cover page, though.
It will be interesting to see how much the latest, more generalised, GPUs widen the range of problems that speed up significantly. It seems they should be appearing about now. If they start showing worthwhile gains for problems involving numerous short vectors (e.g. the endless 40 element vectors in the G.729 speech codec), I might start to get really interested in them.
Steve