Article: Parallelism at HotPar 2010
By: David Kanter (dkanter.delete@this.realworldtech.com), July 29, 2010 11:38 am
Room: Moderated Discussions
[snip]
>>>I have two discussion questions for the section "The Limits of GPUs"
>>>
>>>Firstly, a lot of this content seem to run along the lines of 'untuned Intel xxx
>>>code was a lot slower than the GPU, but then we spent a lot of time rewriting/tuning
>>>the CPu code, and it got faster!' however no mention seems >to be made of similar tuning efforts in GPU code.
>>
>>Well I expect for most of the GPU vs. CPU comparisons made by NV, there was plenty
>>of GPU tuning. I trust their marketing department to do a good job. Ditto for ATI if they were in that game.
>
>We are not looking here at figures of an NVidia tuned >implementation versus another
>tuned implementation though are we - they is mention of >strong tuning efforts on
>the CPU side, and no mention on the GPU side, that is all >I am raising. Please do
>not take it that I believe or support all the NVidia hype.
If I understand what you are saying correctly, what you are interested in is a maximally tuned GPU vs. maximally tuned CPU comparison, is that right?
>>>I think most of the people involved in GPU Cuda programming will agree that it
>>>is significantly HARDER to extract full potential from GPU code, although for 'suitable'
>>>codes the gains are even larger - this looks/feels like >one of these cases.... highly
>>>tuned CPU code versus basic GPU code.
>>
>>I'm not really sure how true that is. If you look at some of the presentations
>>out there, it's clear that even on Nehalem - which is one of the most well rounded
>>CPUs out there, you can see a 25X improvement from tuning.
>>
>>That's a pretty big factor.
>>
>>How much variation is there once you've written an >>algorithm in CUDA?
>
>I have seen factors of well over a hundred in >restructurings of the same code to
>more closely match the GPUs needs (often to do with memory >access patterns). As
>you say Nehalem is a well rounded CPU, I would not >consider any current GPU nearly
>as well rounded - they are highly optimisation sentitive, >and the optimisations
>required are more difficult, not less (often due to the >lack of tools for suitably detailed profiling, etc)
Can you give some examples here of the performance improvement due to changes in the coding of an algorithm?
I think one of the issues with GPUs is that proponents of CUDA really try and portray it as an easy development environment. I think it's obvious that CUDA is far ahead of everyone else, but based on what you're saying 'easy' is not the right adjective at all.
>>NV marketing keeps on saying how CUDA makes programming much easier, yet it sounds
>>like you are saying it really isn't good enough.
>
>Do they? that must be something I keep missing - they like >to blow their trumpet
>about good outcomes, and over generalise these, and also >talk about peak rates too
>much (as everyone does), but do they really say its easy?
Yes they do, often. It tends to be in verbal conversation rather than in slides.
>It has certainly become a lot EASIER, due to Cuda and >OpenCL, however I dont think
>anyone would call it easy, just look at NVidias own >examples.. they are rarely simple,
>even though they often deal with quite trivial codes..
[snip]
>>>It is also interesting (and the full information is not presented) that in the
>>>second group of cases, we seems to be comparing DP codes on Tesla C1060, rather than the most certanly current 2070.
>>>Now, a C1060 has around an 8:1 SP:DP ratio. The 2060 closer to 2:1, and nearly
>>>7 TIMES the peak DP with a single GPU than the 1060... I do not doubt that the Nehalem
>>>system is not the fastest current either, however I doubt a system 7 times faster could be found.
>>>Secondly in this case, the codes being looked at are, as NVidia appears to have
>>>pointed out, not really prime targets of their systems >anyway (any yet their OLD systems do pretty well).
>>>
>>>Now, this could be seen as valid 'limits' of GPUs.
>>>1 - Older implementations (and some current) are not great >at DP.
>>>2 - GPUs are very optimisation sensitive (tools are quite new, and they are not that flexible compute devices)
>>>3 - GPUs performance varies strongly, not all target >applications are suitable.
>>
>>I think the overall moral of the story is that if you see a performance gap of
>>>4-5X between a CPU and GPU, you should look closely at the code (and the hardware
>>too). GPUs are fast on the right problems, but they should not be 10X faster.
>>Especially on bandwidth bound problems where the gap narrows considerably.
>
>Are you missing that fact that the 2060 has 7 TIMES the DP >capability of the 1060
>that they used? the moral of that story seems to be >selective choice of benchmarks, or is it a historical >piece?
Not at all.
>GPU can well be 10X faster, or more in some cases, but >anyone who tries to over-gernalise
>that is most likely being foolish, drinking the koolaid, >or does not understand.
The only scenario I can see where a GPU would be >10X faster (and both CPU and GPU are coded well) would be where there is heavy dependence on operations which are slow on the CPU, but fast on the CPU.
I'm not sure which operations fall into that category, but perhaps some things like divides or square roots may be faster on a GPU.
Even the fastest GPU today has ~170GB/s of memory bandwidth, which is only 4X more than magny-cours and 5.5X more than Nehalem-EP.
>In fact, for certain interesting cases, where their more >specialised hardware can
>be used, they can be well over 10X faster, however these >cases are quite specific.
Yes, I'd agree.
>My queries on their final page of this article was more about the fact that it
>seemed to pick some rather skewed cases which seemed shall we say 'selected for a purpose'.
>It is not difficult to find code that performs terrible on >the GPU, very terribly
>in cases.. but equally there is code that performs >fantastically, GPUs are just
>not a generalised CPU, and (other than a few one-liners >from marketing departments)
>I dont really think anyone claims they are.
I totally agree. GPUs are suitable for very specific sorts of workloads, where they fit, they tend to provide very good performance. However, if you fall outside that region, it's often very ugly.
But here's an example of a rather unrealistic and glowing portrayal of the GPU:
http://www.hpcwire.com/features/Kudos-for-CUDA-97889444.html
"Contrary to the accepted wisdom that GPU computing is more difficult, I believe its success thus far signals that it is no more complicated than good CPU programming."
That contradicts what you said before.
The article also implies that somehow GPU programming (and CUDA specifically) replaces SSE+OpenMP+MPI, which is just a total load of BS. A GPU can avoid the need for SSE, and I think you can make the argument that CUDA is more elegant than AVX/SSE. But the whole point of OpenMP is shared memory communication, which isn't feasible on GPUs - that's not a feature! And you still need MPI for most problems.
Another example is this piece on Forbes:
http://www.forbes.com/2010/04/29/moores-law-computing-processing-opinions-contributors-bill-dally.html
Moore's Law has always been about economic viability of transistor density and integration. It's been conflated by some people to relate to performance, but that's inaccurate. CPU performance scaling is not dead - but single threaded gains are vastly reduced. Parallel performance on CPUs is still increasing at a good clip. The notion that 'multi-core' is somehow a dead end is specious at best, considering that GPUs are themselves multi-core architectures.
>I must say that I do welcome and enjoy these articles, but >a lot of people seem
>to love the 'GPU is no good as it is not a generalised >CPU!' strawman, and then
>go selecting poor useage cases to defend that - when it is >obvious to anyone who
>applies even a little critical thinking that they are not.
I'm sure some people hold that view. My perspective is more nuanced and probably more in-line with yours actually.
I see the GPU as a relatively new platform, one that holds a good deal of promise for certain highly structured and HPC-like workloads that are free from dependencies. It's fundamentally different from a CPU in that it's really a bandwidth optimized device, and there are certain trade-offs that implies which make it unsuitable for many workloads/algorithms.
Ultimately, the right balance is a combination of the CPU and GPU. What isn't clear is where that balance lies. The notion that you don't need a high performance CPU is not particularly credible since even for embarrassingly parallel workloads, there tends to be a fraction which is 'serial', and will limit performance gains.
When I hear crap like "the only interesting workloads are amenable to GPUs", it's quite annoying. Ditto for claimed 100X speed ups.
The programming models are also a huge open question, but beyond the scope of this post.
David
>>>I have two discussion questions for the section "The Limits of GPUs"
>>>
>>>Firstly, a lot of this content seem to run along the lines of 'untuned Intel xxx
>>>code was a lot slower than the GPU, but then we spent a lot of time rewriting/tuning
>>>the CPu code, and it got faster!' however no mention seems >to be made of similar tuning efforts in GPU code.
>>
>>Well I expect for most of the GPU vs. CPU comparisons made by NV, there was plenty
>>of GPU tuning. I trust their marketing department to do a good job. Ditto for ATI if they were in that game.
>
>We are not looking here at figures of an NVidia tuned >implementation versus another
>tuned implementation though are we - they is mention of >strong tuning efforts on
>the CPU side, and no mention on the GPU side, that is all >I am raising. Please do
>not take it that I believe or support all the NVidia hype.
If I understand what you are saying correctly, what you are interested in is a maximally tuned GPU vs. maximally tuned CPU comparison, is that right?
>>>I think most of the people involved in GPU Cuda programming will agree that it
>>>is significantly HARDER to extract full potential from GPU code, although for 'suitable'
>>>codes the gains are even larger - this looks/feels like >one of these cases.... highly
>>>tuned CPU code versus basic GPU code.
>>
>>I'm not really sure how true that is. If you look at some of the presentations
>>out there, it's clear that even on Nehalem - which is one of the most well rounded
>>CPUs out there, you can see a 25X improvement from tuning.
>>
>>That's a pretty big factor.
>>
>>How much variation is there once you've written an >>algorithm in CUDA?
>
>I have seen factors of well over a hundred in >restructurings of the same code to
>more closely match the GPUs needs (often to do with memory >access patterns). As
>you say Nehalem is a well rounded CPU, I would not >consider any current GPU nearly
>as well rounded - they are highly optimisation sentitive, >and the optimisations
>required are more difficult, not less (often due to the >lack of tools for suitably detailed profiling, etc)
Can you give some examples here of the performance improvement due to changes in the coding of an algorithm?
I think one of the issues with GPUs is that proponents of CUDA really try and portray it as an easy development environment. I think it's obvious that CUDA is far ahead of everyone else, but based on what you're saying 'easy' is not the right adjective at all.
>>NV marketing keeps on saying how CUDA makes programming much easier, yet it sounds
>>like you are saying it really isn't good enough.
>
>Do they? that must be something I keep missing - they like >to blow their trumpet
>about good outcomes, and over generalise these, and also >talk about peak rates too
>much (as everyone does), but do they really say its easy?
Yes they do, often. It tends to be in verbal conversation rather than in slides.
>It has certainly become a lot EASIER, due to Cuda and >OpenCL, however I dont think
>anyone would call it easy, just look at NVidias own >examples.. they are rarely simple,
>even though they often deal with quite trivial codes..
[snip]
>>>It is also interesting (and the full information is not presented) that in the
>>>second group of cases, we seems to be comparing DP codes on Tesla C1060, rather than the most certanly current 2070.
>>>Now, a C1060 has around an 8:1 SP:DP ratio. The 2060 closer to 2:1, and nearly
>>>7 TIMES the peak DP with a single GPU than the 1060... I do not doubt that the Nehalem
>>>system is not the fastest current either, however I doubt a system 7 times faster could be found.
>>>Secondly in this case, the codes being looked at are, as NVidia appears to have
>>>pointed out, not really prime targets of their systems >anyway (any yet their OLD systems do pretty well).
>>>
>>>Now, this could be seen as valid 'limits' of GPUs.
>>>1 - Older implementations (and some current) are not great >at DP.
>>>2 - GPUs are very optimisation sensitive (tools are quite new, and they are not that flexible compute devices)
>>>3 - GPUs performance varies strongly, not all target >applications are suitable.
>>
>>I think the overall moral of the story is that if you see a performance gap of
>>>4-5X between a CPU and GPU, you should look closely at the code (and the hardware
>>too). GPUs are fast on the right problems, but they should not be 10X faster.
>>Especially on bandwidth bound problems where the gap narrows considerably.
>
>Are you missing that fact that the 2060 has 7 TIMES the DP >capability of the 1060
>that they used? the moral of that story seems to be >selective choice of benchmarks, or is it a historical >piece?
Not at all.
>GPU can well be 10X faster, or more in some cases, but >anyone who tries to over-gernalise
>that is most likely being foolish, drinking the koolaid, >or does not understand.
The only scenario I can see where a GPU would be >10X faster (and both CPU and GPU are coded well) would be where there is heavy dependence on operations which are slow on the CPU, but fast on the CPU.
I'm not sure which operations fall into that category, but perhaps some things like divides or square roots may be faster on a GPU.
Even the fastest GPU today has ~170GB/s of memory bandwidth, which is only 4X more than magny-cours and 5.5X more than Nehalem-EP.
>In fact, for certain interesting cases, where their more >specialised hardware can
>be used, they can be well over 10X faster, however these >cases are quite specific.
Yes, I'd agree.
>My queries on their final page of this article was more about the fact that it
>seemed to pick some rather skewed cases which seemed shall we say 'selected for a purpose'.
>It is not difficult to find code that performs terrible on >the GPU, very terribly
>in cases.. but equally there is code that performs >fantastically, GPUs are just
>not a generalised CPU, and (other than a few one-liners >from marketing departments)
>I dont really think anyone claims they are.
I totally agree. GPUs are suitable for very specific sorts of workloads, where they fit, they tend to provide very good performance. However, if you fall outside that region, it's often very ugly.
But here's an example of a rather unrealistic and glowing portrayal of the GPU:
http://www.hpcwire.com/features/Kudos-for-CUDA-97889444.html
"Contrary to the accepted wisdom that GPU computing is more difficult, I believe its success thus far signals that it is no more complicated than good CPU programming."
That contradicts what you said before.
The article also implies that somehow GPU programming (and CUDA specifically) replaces SSE+OpenMP+MPI, which is just a total load of BS. A GPU can avoid the need for SSE, and I think you can make the argument that CUDA is more elegant than AVX/SSE. But the whole point of OpenMP is shared memory communication, which isn't feasible on GPUs - that's not a feature! And you still need MPI for most problems.
Another example is this piece on Forbes:
http://www.forbes.com/2010/04/29/moores-law-computing-processing-opinions-contributors-bill-dally.html
Moore's Law has always been about economic viability of transistor density and integration. It's been conflated by some people to relate to performance, but that's inaccurate. CPU performance scaling is not dead - but single threaded gains are vastly reduced. Parallel performance on CPUs is still increasing at a good clip. The notion that 'multi-core' is somehow a dead end is specious at best, considering that GPUs are themselves multi-core architectures.
>I must say that I do welcome and enjoy these articles, but >a lot of people seem
>to love the 'GPU is no good as it is not a generalised >CPU!' strawman, and then
>go selecting poor useage cases to defend that - when it is >obvious to anyone who
>applies even a little critical thinking that they are not.
I'm sure some people hold that view. My perspective is more nuanced and probably more in-line with yours actually.
I see the GPU as a relatively new platform, one that holds a good deal of promise for certain highly structured and HPC-like workloads that are free from dependencies. It's fundamentally different from a CPU in that it's really a bandwidth optimized device, and there are certain trade-offs that implies which make it unsuitable for many workloads/algorithms.
Ultimately, the right balance is a combination of the CPU and GPU. What isn't clear is where that balance lies. The notion that you don't need a high performance CPU is not particularly credible since even for embarrassingly parallel workloads, there tends to be a fraction which is 'serial', and will limit performance gains.
When I hear crap like "the only interesting workloads are amenable to GPUs", it's quite annoying. Ditto for claimed 100X speed ups.
The programming models are also a huge open question, but beyond the scope of this post.
David