Article: Parallelism at HotPar 2010
By: Michael S (already5chosen.delete@this.yahoo.com), August 3, 2010 12:41 am
Room: Moderated Discussions
anon (anon@anon.com) on 8/2/10 wrote:
---------------------------
>Michael S (already5chosen@yahoo.com) on 8/2/10 wrote:
>---------------------------
>>Mark Roulo (nothanks@xxx.com) on 8/2/10 wrote:
>>---------------------------
>>>AM (myname4rwt@jee-male.com) on 8/2/10 wrote:
>>>---------------------------
>>>>David Kanter (dkanter@realworldtech.com) on 7/29/10 wrote:
>>>>---------------------------
>>>>...
>>>>>When I hear crap like "the only interesting workloads are amenable to GPUs", it's
>>>>>quite annoying. Ditto for claimed 100X speed ups.
>>>
>>>
>>>>And since you are apparently calling crap all 100x and higher speedups, it's reasonable
>>>>to ask if you have any proof wrt every piece of published research with such results.
>>>>I don't think you have any though.
>>>
>>>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>>>
>>>The raw bandwidth advantage is in the 4x to 5x range.
>>>
>>>The GPU is less flexible.
>>>
>>>I would suggest that when a paper claims a 10x to 20x speed increase above the
>>>raw hardware advantage, it is up to the paper authors to explain where this advantage came from.
>>>
>>>I've been doing GPU programming for about one year now, and
>>>having looked at a number of these papers (and wandered around the nVidia GPGPU
>>>conference one year), I can say that the vast majority of the papers claiming more
>>>than 4-5x performance do it using one or more of the following techniques:
>>>
>>>1) Use only one CPU core,
>>>2) Use scalar CPU code,
>>>3) Fail to strip-mine the CPU code for good cache locality,
>>>4) Use a better algorithm on the GPU (e.g. N3 for the CPU, N log (N) for the GPU).
>>>
>>>I have seen examples of all four, with (1)-(3) being the most common (and these three often appear together).
>>>
>>>My back-of-the-envelope guesstimate is that going from 1 core to 6 on the CPU is
>>>worth 4x to 6x (we can use 5x as a nice middle ground), using the vector unit is
>>>4x, and strip-mining can get you 2x or more. Put them together and you get 5x4x2
>>>= 40x speedup of well optimized CPU code versus simple scalar code. If the GPU
>>>is being compared to a CPU running simple scalar code, you might see a 100x to 200x
>>>claim, but this will turn into a 2½x to 5x claim if run against well optimized CPU code.
>>>
>>>It is clearly unreasonable to expect David (or anyone else) to read every paper
>>>claiming unrealistic speedups, but the hardware just isn't there for the GPU to
>>>see more than about 10x. When the underlying hardware can't do something that it
>>>is claimed to do, I think the burden of proof properly belongs on the people making the claim.
>>>
>>>-Mark Roulo
>>>
>>
>>comp.arch discussion:
>>http://groups.google.com/group/comp.arch/browse_frm/thread/5aadb32f1b30a7e4/5424f4f8d89e16e9#5424f4f8d89e16e9
>>
>>In particular this short message by Terje Mathisen
>>http://groups.google.com/group/comp.arch/msg/d36a9e35dc31c73c
>>
>>>
>>
>>The link, he provided:
>>http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_Seismic_Hess.pdf
>>
>>
>>Unfortunately, article contains about zero information about reference "CPU" implementation
>>over which they claim a major gain. However they sound honest, article doesn't leave
>>an impression of NVidia-sponsored cheat.
>
>Well it is somewhat nvidia sponsored. The fact that they don't provide anything
>about the CPUs used or the details of the code raises red flags here.
>
>The "CPU" implementation is apparently their production code, so it should be quite
>well optimized. However the production CUDA kernel appears to be quite different
>(several kernels doing different steps), it's not clear if results are equivalent
>or if similar optimizations would improve the CPU performance.
>
>Also, the comparisons they are making appears to be against their old CPU cluster.
>Seeing as they bought a first GPU cluster in Jan 2008 and do upgrades about once
>per year, the comparison would probably be on early 2007 era CPUs, quite likely dual core Opterons.
>
>In summary, it looks like more GPU waffle. If they really had a 10x-80x performance
>speedup against an equivalent implementation on contemporary CPUs they would be absolutely stressing those points.
>
>I have no problem imagining there is a reasonable gain over similarly optimized
>implementation on current CPUs, but I could easily imagine it is far below 10x in common cases in practice.
>
Well, there is at least one case in which I'd easily believe in 100x gain - case of your algorithm using texture interpolation.
According to my understanding of HESS article they used texture units for efficient caching but not for interpolation.
---------------------------
>Michael S (already5chosen@yahoo.com) on 8/2/10 wrote:
>---------------------------
>>Mark Roulo (nothanks@xxx.com) on 8/2/10 wrote:
>>---------------------------
>>>AM (myname4rwt@jee-male.com) on 8/2/10 wrote:
>>>---------------------------
>>>>David Kanter (dkanter@realworldtech.com) on 7/29/10 wrote:
>>>>---------------------------
>>>>...
>>>>>When I hear crap like "the only interesting workloads are amenable to GPUs", it's
>>>>>quite annoying. Ditto for claimed 100X speed ups.
>>>
>>>
>>>>And since you are apparently calling crap all 100x and higher speedups, it's reasonable
>>>>to ask if you have any proof wrt every piece of published research with such results.
>>>>I don't think you have any though.
>>>
>>>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>>>
>>>The raw bandwidth advantage is in the 4x to 5x range.
>>>
>>>The GPU is less flexible.
>>>
>>>I would suggest that when a paper claims a 10x to 20x speed increase above the
>>>raw hardware advantage, it is up to the paper authors to explain where this advantage came from.
>>>
>>>I've been doing GPU programming for about one year now, and
>>>having looked at a number of these papers (and wandered around the nVidia GPGPU
>>>conference one year), I can say that the vast majority of the papers claiming more
>>>than 4-5x performance do it using one or more of the following techniques:
>>>
>>>1) Use only one CPU core,
>>>2) Use scalar CPU code,
>>>3) Fail to strip-mine the CPU code for good cache locality,
>>>4) Use a better algorithm on the GPU (e.g. N3 for the CPU, N log (N) for the GPU).
>>>
>>>I have seen examples of all four, with (1)-(3) being the most common (and these three often appear together).
>>>
>>>My back-of-the-envelope guesstimate is that going from 1 core to 6 on the CPU is
>>>worth 4x to 6x (we can use 5x as a nice middle ground), using the vector unit is
>>>4x, and strip-mining can get you 2x or more. Put them together and you get 5x4x2
>>>= 40x speedup of well optimized CPU code versus simple scalar code. If the GPU
>>>is being compared to a CPU running simple scalar code, you might see a 100x to 200x
>>>claim, but this will turn into a 2½x to 5x claim if run against well optimized CPU code.
>>>
>>>It is clearly unreasonable to expect David (or anyone else) to read every paper
>>>claiming unrealistic speedups, but the hardware just isn't there for the GPU to
>>>see more than about 10x. When the underlying hardware can't do something that it
>>>is claimed to do, I think the burden of proof properly belongs on the people making the claim.
>>>
>>>-Mark Roulo
>>>
>>
>>comp.arch discussion:
>>http://groups.google.com/group/comp.arch/browse_frm/thread/5aadb32f1b30a7e4/5424f4f8d89e16e9#5424f4f8d89e16e9
>>
>>In particular this short message by Terje Mathisen
>>http://groups.google.com/group/comp.arch/msg/d36a9e35dc31c73c
>>
>>>
>>
>>The link, he provided:
>>http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_Seismic_Hess.pdf
>>
>>
>>Unfortunately, article contains about zero information about reference "CPU" implementation
>>over which they claim a major gain. However they sound honest, article doesn't leave
>>an impression of NVidia-sponsored cheat.
>
>Well it is somewhat nvidia sponsored. The fact that they don't provide anything
>about the CPUs used or the details of the code raises red flags here.
>
>The "CPU" implementation is apparently their production code, so it should be quite
>well optimized. However the production CUDA kernel appears to be quite different
>(several kernels doing different steps), it's not clear if results are equivalent
>or if similar optimizations would improve the CPU performance.
>
>Also, the comparisons they are making appears to be against their old CPU cluster.
>Seeing as they bought a first GPU cluster in Jan 2008 and do upgrades about once
>per year, the comparison would probably be on early 2007 era CPUs, quite likely dual core Opterons.
>
>In summary, it looks like more GPU waffle. If they really had a 10x-80x performance
>speedup against an equivalent implementation on contemporary CPUs they would be absolutely stressing those points.
>
>I have no problem imagining there is a reasonable gain over similarly optimized
>implementation on current CPUs, but I could easily imagine it is far below 10x in common cases in practice.
>
Well, there is at least one case in which I'd easily believe in 100x gain - case of your algorithm using texture interpolation.
According to my understanding of HESS article they used texture units for efficient caching but not for interpolation.