Article: Parallelism at HotPar 2010
By: David Kanter (dkanter.delete@this.realworldtech.com), July 27, 2010 11:48 pm
Room: Moderated Discussions
Anon (no@email.com) on 7/27/10 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 7/27/10 wrote:
>---------------------------
>>Today is shaping up to be an excellent day for a number of reasons:
>>
>>We have an excellent new contributor, Tarek Chammah. Tarek is a graduate student
>>at the University of Waterloo who specializes in software approaches to parallelization,
>>including run-times, languages, APIs, etc.
>>
>>I recently had the opportunity to go to lunch with Tarek and we had an excellent
>>time and I learned quite a lot about the trends on the software side of the equation.
>>One of the points that Tarek emphatically made is that with the emergence of parallel
>>processing, the software is becoming equally important as the hardware. Times are
>>a changing, and it's not just about your good old compiler getting code into shape
>>for the hardware; software is truly an essential part of the glue that binds together
>>the system, and I hope to be able to discuss software more in the future at RWT.
>>
>>Second, Tarek has provided us with an excellent article covering some of the highlights
>>of the HotPar 2010 workshop. Hot Par was held in Berkeley this year, and included
>>a fair number of papers - but almost all of them were software focused. This is
>>a nice change of pace from our usual coverage of ISSCC, Hot Chips or IEDM:
>>
>>http://www.realworldtech.com/page.cfm?ArticleID=RWT072610001641
>>
>>Please join me in thanking Tarek for his contribution, and I look forward to some lively discussions.
>>
>>
>>David
>
>I would most certainly agree that more input on this subject is most welcome.
>
>I have two discussion questions for the section "The Limits of GPUs"
>
>Firstly, a lot of this content seem to run along the lines of 'untuned Intel xxx
>code was a lot slower than the GPU, but then we spent a lot of time rewriting/tuning
>the CPu code, and it got faster!' however no mention seems >to be made of similar tuning efforts in GPU code.
Well I expect for most of the GPU vs. CPU comparisons made by NV, there was plenty of GPU tuning. I trust their marketing department to do a good job. Ditto for ATI if they were in that game.
>I think most of the people involved in GPU Cuda programming will agree that it
>is significantly HARDER to extract full potential from GPU code, although for 'suitable'
>codes the gains are even larger - this looks/feels like >one of these cases.... highly
>tuned CPU code versus basic GPU code.
I'm not really sure how true that is. If you look at some of the presentations out there, it's clear that even on Nehalem - which is one of the most well rounded CPUs out there, you can see a 25X improvement from tuning.
That's a pretty big factor.
How much variation is there once you've written an algorithm in CUDA?
NV marketing keeps on saying how CUDA makes programming much easier, yet it sounds like you are saying it really isn't good enough.
Also, what about ATI GPUs? I expect a lot more variability there.
>It is also interesting (and the full information is not presented) that in the
>second group of cases, we seems to be comparing DP codes on Tesla C1060, rather than the most certanly current 2070.
>Now, a C1060 has around an 8:1 SP:DP ratio. The 2060 closer to 2:1, and nearly
>7 TIMES the peak DP with a single GPU than the 1060... I do not doubt that the Nehalem
>system is not the fastest current either, however I doubt a system 7 times faster could be found.
>Secondly in this case, the codes being looked at are, as NVidia appears to have
>pointed out, not really prime targets of their systems >anyway (any yet their OLD systems do pretty well).
>
>Now, this could be seen as valid 'limits' of GPUs.
>1 - Older implementations (and some current) are not great >at DP.
>2 - GPUs are very optimisation sensitive (tools are quite new, and they are not that flexible compute devices)
>3 - GPUs performance varies strongly, not all target >applications are suitable.
I think the overall moral of the story is that if you see a performance gap of >4-5X between a CPU and GPU, you should look closely at the code (and the hardware too). GPUs are fast on the right problems, but they should not be 10X faster. Especially on bandwidth bound problems where the gap narrows considerably.
David
---------------------------
>David Kanter (dkanter@realworldtech.com) on 7/27/10 wrote:
>---------------------------
>>Today is shaping up to be an excellent day for a number of reasons:
>>
>>We have an excellent new contributor, Tarek Chammah. Tarek is a graduate student
>>at the University of Waterloo who specializes in software approaches to parallelization,
>>including run-times, languages, APIs, etc.
>>
>>I recently had the opportunity to go to lunch with Tarek and we had an excellent
>>time and I learned quite a lot about the trends on the software side of the equation.
>>One of the points that Tarek emphatically made is that with the emergence of parallel
>>processing, the software is becoming equally important as the hardware. Times are
>>a changing, and it's not just about your good old compiler getting code into shape
>>for the hardware; software is truly an essential part of the glue that binds together
>>the system, and I hope to be able to discuss software more in the future at RWT.
>>
>>Second, Tarek has provided us with an excellent article covering some of the highlights
>>of the HotPar 2010 workshop. Hot Par was held in Berkeley this year, and included
>>a fair number of papers - but almost all of them were software focused. This is
>>a nice change of pace from our usual coverage of ISSCC, Hot Chips or IEDM:
>>
>>http://www.realworldtech.com/page.cfm?ArticleID=RWT072610001641
>>
>>Please join me in thanking Tarek for his contribution, and I look forward to some lively discussions.
>>
>>
>>David
>
>I would most certainly agree that more input on this subject is most welcome.
>
>I have two discussion questions for the section "The Limits of GPUs"
>
>Firstly, a lot of this content seem to run along the lines of 'untuned Intel xxx
>code was a lot slower than the GPU, but then we spent a lot of time rewriting/tuning
>the CPu code, and it got faster!' however no mention seems >to be made of similar tuning efforts in GPU code.
Well I expect for most of the GPU vs. CPU comparisons made by NV, there was plenty of GPU tuning. I trust their marketing department to do a good job. Ditto for ATI if they were in that game.
>I think most of the people involved in GPU Cuda programming will agree that it
>is significantly HARDER to extract full potential from GPU code, although for 'suitable'
>codes the gains are even larger - this looks/feels like >one of these cases.... highly
>tuned CPU code versus basic GPU code.
I'm not really sure how true that is. If you look at some of the presentations out there, it's clear that even on Nehalem - which is one of the most well rounded CPUs out there, you can see a 25X improvement from tuning.
That's a pretty big factor.
How much variation is there once you've written an algorithm in CUDA?
NV marketing keeps on saying how CUDA makes programming much easier, yet it sounds like you are saying it really isn't good enough.
Also, what about ATI GPUs? I expect a lot more variability there.
>It is also interesting (and the full information is not presented) that in the
>second group of cases, we seems to be comparing DP codes on Tesla C1060, rather than the most certanly current 2070.
>Now, a C1060 has around an 8:1 SP:DP ratio. The 2060 closer to 2:1, and nearly
>7 TIMES the peak DP with a single GPU than the 1060... I do not doubt that the Nehalem
>system is not the fastest current either, however I doubt a system 7 times faster could be found.
>Secondly in this case, the codes being looked at are, as NVidia appears to have
>pointed out, not really prime targets of their systems >anyway (any yet their OLD systems do pretty well).
>
>Now, this could be seen as valid 'limits' of GPUs.
>1 - Older implementations (and some current) are not great >at DP.
>2 - GPUs are very optimisation sensitive (tools are quite new, and they are not that flexible compute devices)
>3 - GPUs performance varies strongly, not all target >applications are suitable.
I think the overall moral of the story is that if you see a performance gap of >4-5X between a CPU and GPU, you should look closely at the code (and the hardware too). GPUs are fast on the right problems, but they should not be 10X faster. Especially on bandwidth bound problems where the gap narrows considerably.
David