Article: Parallelism at HotPar 2010
By: anon (anon.delete@this.anon.com), July 28, 2010 5:10 pm
Room: Moderated Discussions
Anon (no@email.com) on 7/28/10 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 7/28/10 wrote:
>---------------------------
>>Anon (no@email.com) on 7/27/10 wrote:
>>---------------------------
>>>David Kanter (dkanter@realworldtech.com) on 7/27/10 wrote:
>>>---------------------------
>>>>Today is shaping up to be an excellent day for a number of reasons:
>>>>
>>>>We have an excellent new contributor, Tarek Chammah. Tarek is a graduate student
>>>>at the University of Waterloo who specializes in software approaches to parallelization,
>>>>including run-times, languages, APIs, etc.
>>>>
>>>>I recently had the opportunity to go to lunch with Tarek and we had an excellent
>>>>time and I learned quite a lot about the trends on the software side of the equation.
>>>>One of the points that Tarek emphatically made is that with the emergence of parallel
>>>>processing, the software is becoming equally important as the hardware. Times are
>>>>a changing, and it's not just about your good old compiler getting code into shape
>>>>for the hardware; software is truly an essential part of the glue that binds together
>>>>the system, and I hope to be able to discuss software more in the future at RWT.
>>>>
>>>>Second, Tarek has provided us with an excellent article covering some of the highlights
>>>>of the HotPar 2010 workshop. Hot Par was held in Berkeley this year, and included
>>>>a fair number of papers - but almost all of them were software focused. This is
>>>>a nice change of pace from our usual coverage of ISSCC, Hot Chips or IEDM:
>>>>
>>>>http://www.realworldtech.com/page.cfm?ArticleID=RWT072610001641
>>>>
>>>>Please join me in thanking Tarek for his contribution, and I look forward to some lively discussions.
>>>>
>>>>
>>>>David
>>>
>>>I would most certainly agree that more input on this subject is most welcome.
>>>
>>>I have two discussion questions for the section "The Limits of GPUs"
>>>
>>>Firstly, a lot of this content seem to run along the lines of 'untuned Intel xxx
>>>code was a lot slower than the GPU, but then we spent a lot of time rewriting/tuning
>>>the CPu code, and it got faster!' however no mention seems >to be made of similar tuning efforts in GPU code.
>>
>>Well I expect for most of the GPU vs. CPU comparisons made by NV, there was plenty
>>of GPU tuning. I trust their marketing department to do a good job. Ditto for ATI if they were in that game.
>
>We are not looking here at figures of an NVidia tuned implementation versus another
>tuned implementation though are we - they is mention of strong tuning efforts on
>the CPU side, and no mention on the GPU side, that is all I am raising. Please do
>not take it that I believe or support all the NVidia hype.
>
>>>I think most of the people involved in GPU Cuda programming will agree that it
>>>is significantly HARDER to extract full potential from GPU code, although for 'suitable'
>>>codes the gains are even larger - this looks/feels like >one of these cases.... highly
>>>tuned CPU code versus basic GPU code.
>>
>>I'm not really sure how true that is. If you look at some of the presentations
>>out there, it's clear that even on Nehalem - which is one of the most well rounded
>>CPUs out there, you can see a 25X improvement from tuning.
>>
>>That's a pretty big factor.
>>
>>How much variation is there once you've written an algorithm in CUDA?
>
>I have seen factors of well over a hundred in restructurings of the same code to
>more closely match the GPUs needs (often to do with memory access patterns). As
>you say Nehalem is a well rounded CPU, I would not consider any current GPU nearly
>as well rounded - they are highly optimisation sentitive, and the optimisations
>required are more difficult, not less (often due to the lack of tools for suitably detailed profiling, etc)
>
>>NV marketing keeps on saying how CUDA makes programming much easier, yet it sounds
>>like you are saying it really isn't good enough.
>
>Do they? that must be something I keep missing - they like to blow their trumpet
>about good outcomes, and over generalise these, and also talk about peak rates too
>much (as everyone does), but do they really say its easy?
>
>It has certainly become a lot EASIER, due to Cuda and OpenCL, however I dont think
>anyone would call it easy, just look at NVidias own examples.. they are rarely simple,
>even though they often deal with quite trivial codes..
>
>>Also, what about ATI GPUs? I expect a lot more variability there.
>
>So do I, and thats what I find when I test them, however due to the mountain of
>software issues they carry, that is not that often (something that is improving slowly, but very very slowly).
>
>>>It is also interesting (and the full information is not presented) that in the
>>>second group of cases, we seems to be comparing DP codes on Tesla C1060, rather than the most certanly current 2070.
>>>Now, a C1060 has around an 8:1 SP:DP ratio. The 2060 closer to 2:1, and nearly
>>>7 TIMES the peak DP with a single GPU than the 1060... I do not doubt that the Nehalem
>>>system is not the fastest current either, however I doubt a system 7 times faster could be found.
>>>Secondly in this case, the codes being looked at are, as NVidia appears to have
>>>pointed out, not really prime targets of their systems >anyway (any yet their OLD systems do pretty well).
>>>
>>>Now, this could be seen as valid 'limits' of GPUs.
>>>1 - Older implementations (and some current) are not great >at DP.
>>>2 - GPUs are very optimisation sensitive (tools are quite new, and they are not that flexible compute devices)
>>>3 - GPUs performance varies strongly, not all target >applications are suitable.
>>
>>I think the overall moral of the story is that if you see a performance gap of
>>>4-5X between a CPU and GPU, you should look closely at the code (and the hardware
>>too). GPUs are fast on the right problems, but they should not be 10X faster.
>>Especially on bandwidth bound problems where the gap narrows considerably.
>
>Are you missing that fact that the 2060 has 7 TIMES the DP capability of the 1060
>that they used? the moral of that story seems to be selective choice of benchmarks, or is it a historical piece?
>
>GPU can well be 10X faster, or more in some cases, but anyone who tries to over-gernalise
>that is most likely being foolish, drinking the koolaid, or does not understand.
>
>In fact, for certain interesting cases, where their more specialised hardware can
>be used, they can be well over 10X faster, however these cases are quite specific.
I'm not doubting you, I don't know much about the topic and you seem to know something. So can you point to some results please? Results meaning -- instance where _useful_ work is being done and the GPU is 10X faster than a good parallel and SIMD implementation on a Nehalem.
"Interesting" case where it is well over 10X faster would be interesting as well, but more important is the 10X case that is doing useful work.
Thanks in advance!
---------------------------
>David Kanter (dkanter@realworldtech.com) on 7/28/10 wrote:
>---------------------------
>>Anon (no@email.com) on 7/27/10 wrote:
>>---------------------------
>>>David Kanter (dkanter@realworldtech.com) on 7/27/10 wrote:
>>>---------------------------
>>>>Today is shaping up to be an excellent day for a number of reasons:
>>>>
>>>>We have an excellent new contributor, Tarek Chammah. Tarek is a graduate student
>>>>at the University of Waterloo who specializes in software approaches to parallelization,
>>>>including run-times, languages, APIs, etc.
>>>>
>>>>I recently had the opportunity to go to lunch with Tarek and we had an excellent
>>>>time and I learned quite a lot about the trends on the software side of the equation.
>>>>One of the points that Tarek emphatically made is that with the emergence of parallel
>>>>processing, the software is becoming equally important as the hardware. Times are
>>>>a changing, and it's not just about your good old compiler getting code into shape
>>>>for the hardware; software is truly an essential part of the glue that binds together
>>>>the system, and I hope to be able to discuss software more in the future at RWT.
>>>>
>>>>Second, Tarek has provided us with an excellent article covering some of the highlights
>>>>of the HotPar 2010 workshop. Hot Par was held in Berkeley this year, and included
>>>>a fair number of papers - but almost all of them were software focused. This is
>>>>a nice change of pace from our usual coverage of ISSCC, Hot Chips or IEDM:
>>>>
>>>>http://www.realworldtech.com/page.cfm?ArticleID=RWT072610001641
>>>>
>>>>Please join me in thanking Tarek for his contribution, and I look forward to some lively discussions.
>>>>
>>>>
>>>>David
>>>
>>>I would most certainly agree that more input on this subject is most welcome.
>>>
>>>I have two discussion questions for the section "The Limits of GPUs"
>>>
>>>Firstly, a lot of this content seem to run along the lines of 'untuned Intel xxx
>>>code was a lot slower than the GPU, but then we spent a lot of time rewriting/tuning
>>>the CPu code, and it got faster!' however no mention seems >to be made of similar tuning efforts in GPU code.
>>
>>Well I expect for most of the GPU vs. CPU comparisons made by NV, there was plenty
>>of GPU tuning. I trust their marketing department to do a good job. Ditto for ATI if they were in that game.
>
>We are not looking here at figures of an NVidia tuned implementation versus another
>tuned implementation though are we - they is mention of strong tuning efforts on
>the CPU side, and no mention on the GPU side, that is all I am raising. Please do
>not take it that I believe or support all the NVidia hype.
>
>>>I think most of the people involved in GPU Cuda programming will agree that it
>>>is significantly HARDER to extract full potential from GPU code, although for 'suitable'
>>>codes the gains are even larger - this looks/feels like >one of these cases.... highly
>>>tuned CPU code versus basic GPU code.
>>
>>I'm not really sure how true that is. If you look at some of the presentations
>>out there, it's clear that even on Nehalem - which is one of the most well rounded
>>CPUs out there, you can see a 25X improvement from tuning.
>>
>>That's a pretty big factor.
>>
>>How much variation is there once you've written an algorithm in CUDA?
>
>I have seen factors of well over a hundred in restructurings of the same code to
>more closely match the GPUs needs (often to do with memory access patterns). As
>you say Nehalem is a well rounded CPU, I would not consider any current GPU nearly
>as well rounded - they are highly optimisation sentitive, and the optimisations
>required are more difficult, not less (often due to the lack of tools for suitably detailed profiling, etc)
>
>>NV marketing keeps on saying how CUDA makes programming much easier, yet it sounds
>>like you are saying it really isn't good enough.
>
>Do they? that must be something I keep missing - they like to blow their trumpet
>about good outcomes, and over generalise these, and also talk about peak rates too
>much (as everyone does), but do they really say its easy?
>
>It has certainly become a lot EASIER, due to Cuda and OpenCL, however I dont think
>anyone would call it easy, just look at NVidias own examples.. they are rarely simple,
>even though they often deal with quite trivial codes..
>
>>Also, what about ATI GPUs? I expect a lot more variability there.
>
>So do I, and thats what I find when I test them, however due to the mountain of
>software issues they carry, that is not that often (something that is improving slowly, but very very slowly).
>
>>>It is also interesting (and the full information is not presented) that in the
>>>second group of cases, we seems to be comparing DP codes on Tesla C1060, rather than the most certanly current 2070.
>>>Now, a C1060 has around an 8:1 SP:DP ratio. The 2060 closer to 2:1, and nearly
>>>7 TIMES the peak DP with a single GPU than the 1060... I do not doubt that the Nehalem
>>>system is not the fastest current either, however I doubt a system 7 times faster could be found.
>>>Secondly in this case, the codes being looked at are, as NVidia appears to have
>>>pointed out, not really prime targets of their systems >anyway (any yet their OLD systems do pretty well).
>>>
>>>Now, this could be seen as valid 'limits' of GPUs.
>>>1 - Older implementations (and some current) are not great >at DP.
>>>2 - GPUs are very optimisation sensitive (tools are quite new, and they are not that flexible compute devices)
>>>3 - GPUs performance varies strongly, not all target >applications are suitable.
>>
>>I think the overall moral of the story is that if you see a performance gap of
>>>4-5X between a CPU and GPU, you should look closely at the code (and the hardware
>>too). GPUs are fast on the right problems, but they should not be 10X faster.
>>Especially on bandwidth bound problems where the gap narrows considerably.
>
>Are you missing that fact that the 2060 has 7 TIMES the DP capability of the 1060
>that they used? the moral of that story seems to be selective choice of benchmarks, or is it a historical piece?
>
>GPU can well be 10X faster, or more in some cases, but anyone who tries to over-gernalise
>that is most likely being foolish, drinking the koolaid, or does not understand.
>
>In fact, for certain interesting cases, where their more specialised hardware can
>be used, they can be well over 10X faster, however these cases are quite specific.
I'm not doubting you, I don't know much about the topic and you seem to know something. So can you point to some results please? Results meaning -- instance where _useful_ work is being done and the GPU is 10X faster than a good parallel and SIMD implementation on a Nehalem.
"Interesting" case where it is well over 10X faster would be interesting as well, but more important is the 10X case that is doing useful work.
Thanks in advance!