Article: Parallelism at HotPar 2010
By: anon (anon.delete@this.anon.com), August 2, 2010 8:36 pm
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 8/2/10 wrote:
---------------------------
>Mark Roulo (nothanks@xxx.com) on 8/2/10 wrote:
>---------------------------
>>AM (myname4rwt@jee-male.com) on 8/2/10 wrote:
>>---------------------------
>>>David Kanter (dkanter@realworldtech.com) on 7/29/10 wrote:
>>>---------------------------
>>>...
>>>>When I hear crap like "the only interesting workloads are amenable to GPUs", it's
>>>>quite annoying. Ditto for claimed 100X speed ups.
>>
>>
>>>And since you are apparently calling crap all 100x and higher speedups, it's reasonable
>>>to ask if you have any proof wrt every piece of published research with such results.
>>>I don't think you have any though.
>>
>>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>>
>>The raw bandwidth advantage is in the 4x to 5x range.
>>
>>The GPU is less flexible.
>>
>>I would suggest that when a paper claims a 10x to 20x speed increase above the
>>raw hardware advantage, it is up to the paper authors to explain where this advantage came from.
>>
>>I've been doing GPU programming for about one year now, and
>>having looked at a number of these papers (and wandered around the nVidia GPGPU
>>conference one year), I can say that the vast majority of the papers claiming more
>>than 4-5x performance do it using one or more of the following techniques:
>>
>>1) Use only one CPU core,
>>2) Use scalar CPU code,
>>3) Fail to strip-mine the CPU code for good cache locality,
>>4) Use a better algorithm on the GPU (e.g. N3 for the CPU, N log (N) for the GPU).
>>
>>I have seen examples of all four, with (1)-(3) being the most common (and these three often appear together).
>>
>>My back-of-the-envelope guesstimate is that going from 1 core to 6 on the CPU is
>>worth 4x to 6x (we can use 5x as a nice middle ground), using the vector unit is
>>4x, and strip-mining can get you 2x or more. Put them together and you get 5x4x2
>>= 40x speedup of well optimized CPU code versus simple scalar code. If the GPU
>>is being compared to a CPU running simple scalar code, you might see a 100x to 200x
>>claim, but this will turn into a 2½x to 5x claim if run against well optimized CPU code.
>>
>>It is clearly unreasonable to expect David (or anyone else) to read every paper
>>claiming unrealistic speedups, but the hardware just isn't there for the GPU to
>>see more than about 10x. When the underlying hardware can't do something that it
>>is claimed to do, I think the burden of proof properly belongs on the people making the claim.
>>
>>-Mark Roulo
>>
>
>comp.arch discussion:
>http://groups.google.com/group/comp.arch/browse_frm/thread/5aadb32f1b30a7e4/5424f4f8d89e16e9#5424f4f8d89e16e9
>
>In particular this short message by Terje Mathisen
>http://groups.google.com/group/comp.arch/msg/d36a9e35dc31c73c
>>
>
>The link, he provided:
>http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_Seismic_Hess.pdf
>
>
>Unfortunately, article contains about zero information about reference "CPU" implementation
>over which they claim a major gain. However they sound honest, article doesn't leave
>an impression of NVidia-sponsored cheat.
Well it is somewhat nvidia sponsored. The fact that they don't provide anything about the CPUs used or the details of the code raises red flags here.
The "CPU" implementation is apparently their production code, so it should be quite well optimized. However the production CUDA kernel appears to be quite different (several kernels doing different steps), it's not clear if results are equivalent or if similar optimizations would improve the CPU performance.
Also, the comparisons they are making appears to be against their old CPU cluster. Seeing as they bought a first GPU cluster in Jan 2008 and do upgrades about once per year, the comparison would probably be on early 2007 era CPUs, quite likely dual core Opterons.
In summary, it looks like more GPU waffle. If they really had a 10x-80x performance speedup against an equivalent implementation on contemporary CPUs they would be absolutely stressing those points.
I have no problem imagining there is a reasonable gain over similarly optimized implementation on current CPUs, but I could easily imagine it is far below 10x in common cases in practice.
---------------------------
>Mark Roulo (nothanks@xxx.com) on 8/2/10 wrote:
>---------------------------
>>AM (myname4rwt@jee-male.com) on 8/2/10 wrote:
>>---------------------------
>>>David Kanter (dkanter@realworldtech.com) on 7/29/10 wrote:
>>>---------------------------
>>>...
>>>>When I hear crap like "the only interesting workloads are amenable to GPUs", it's
>>>>quite annoying. Ditto for claimed 100X speed ups.
>>
>>
>>>And since you are apparently calling crap all 100x and higher speedups, it's reasonable
>>>to ask if you have any proof wrt every piece of published research with such results.
>>>I don't think you have any though.
>>
>>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>>
>>The raw bandwidth advantage is in the 4x to 5x range.
>>
>>The GPU is less flexible.
>>
>>I would suggest that when a paper claims a 10x to 20x speed increase above the
>>raw hardware advantage, it is up to the paper authors to explain where this advantage came from.
>>
>>I've been doing GPU programming for about one year now, and
>>having looked at a number of these papers (and wandered around the nVidia GPGPU
>>conference one year), I can say that the vast majority of the papers claiming more
>>than 4-5x performance do it using one or more of the following techniques:
>>
>>1) Use only one CPU core,
>>2) Use scalar CPU code,
>>3) Fail to strip-mine the CPU code for good cache locality,
>>4) Use a better algorithm on the GPU (e.g. N3 for the CPU, N log (N) for the GPU).
>>
>>I have seen examples of all four, with (1)-(3) being the most common (and these three often appear together).
>>
>>My back-of-the-envelope guesstimate is that going from 1 core to 6 on the CPU is
>>worth 4x to 6x (we can use 5x as a nice middle ground), using the vector unit is
>>4x, and strip-mining can get you 2x or more. Put them together and you get 5x4x2
>>= 40x speedup of well optimized CPU code versus simple scalar code. If the GPU
>>is being compared to a CPU running simple scalar code, you might see a 100x to 200x
>>claim, but this will turn into a 2½x to 5x claim if run against well optimized CPU code.
>>
>>It is clearly unreasonable to expect David (or anyone else) to read every paper
>>claiming unrealistic speedups, but the hardware just isn't there for the GPU to
>>see more than about 10x. When the underlying hardware can't do something that it
>>is claimed to do, I think the burden of proof properly belongs on the people making the claim.
>>
>>-Mark Roulo
>>
>
>comp.arch discussion:
>http://groups.google.com/group/comp.arch/browse_frm/thread/5aadb32f1b30a7e4/5424f4f8d89e16e9#5424f4f8d89e16e9
>
>In particular this short message by Terje Mathisen
>http://groups.google.com/group/comp.arch/msg/d36a9e35dc31c73c
>>
>
>The link, he provided:
>http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_Seismic_Hess.pdf
>
>
>Unfortunately, article contains about zero information about reference "CPU" implementation
>over which they claim a major gain. However they sound honest, article doesn't leave
>an impression of NVidia-sponsored cheat.
Well it is somewhat nvidia sponsored. The fact that they don't provide anything about the CPUs used or the details of the code raises red flags here.
The "CPU" implementation is apparently their production code, so it should be quite well optimized. However the production CUDA kernel appears to be quite different (several kernels doing different steps), it's not clear if results are equivalent or if similar optimizations would improve the CPU performance.
Also, the comparisons they are making appears to be against their old CPU cluster. Seeing as they bought a first GPU cluster in Jan 2008 and do upgrades about once per year, the comparison would probably be on early 2007 era CPUs, quite likely dual core Opterons.
In summary, it looks like more GPU waffle. If they really had a 10x-80x performance speedup against an equivalent implementation on contemporary CPUs they would be absolutely stressing those points.
I have no problem imagining there is a reasonable gain over similarly optimized implementation on current CPUs, but I could easily imagine it is far below 10x in common cases in practice.