Article: Parallelism at HotPar 2010
By: Michael S (already5chosen.delete@this.yahoo.com), August 2, 2010 3:31 pm
Room: Moderated Discussions
Mark Roulo (nothanks@xxx.com) on 8/2/10 wrote:
---------------------------
>AM (myname4rwt@jee-male.com) on 8/2/10 wrote:
>---------------------------
>>David Kanter (dkanter@realworldtech.com) on 7/29/10 wrote:
>>---------------------------
>>...
>>>When I hear crap like "the only interesting workloads are amenable to GPUs", it's
>>>quite annoying. Ditto for claimed 100X speed ups.
>
>
>>And since you are apparently calling crap all 100x and higher speedups, it's reasonable
>>to ask if you have any proof wrt every piece of published research with such results.
>>I don't think you have any though.
>
>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>
>The raw bandwidth advantage is in the 4x to 5x range.
>
>The GPU is less flexible.
>
>I would suggest that when a paper claims a 10x to 20x speed increase above the
>raw hardware advantage, it is up to the paper authors to explain where this advantage came from.
>
>I've been doing GPU programming for about one year now, and
>having looked at a number of these papers (and wandered around the nVidia GPGPU
>conference one year), I can say that the vast majority of the papers claiming more
>than 4-5x performance do it using one or more of the following techniques:
>
>1) Use only one CPU core,
>2) Use scalar CPU code,
>3) Fail to strip-mine the CPU code for good cache locality,
>4) Use a better algorithm on the GPU (e.g. N3 for the CPU, N log (N) for the GPU).
>
>I have seen examples of all four, with (1)-(3) being the most common (and these three often appear together).
>
>My back-of-the-envelope guesstimate is that going from 1 core to 6 on the CPU is
>worth 4x to 6x (we can use 5x as a nice middle ground), using the vector unit is
>4x, and strip-mining can get you 2x or more. Put them together and you get 5x4x2
>= 40x speedup of well optimized CPU code versus simple scalar code. If the GPU
>is being compared to a CPU running simple scalar code, you might see a 100x to 200x
>claim, but this will turn into a 2½x to 5x claim if run against well optimized CPU code.
>
>It is clearly unreasonable to expect David (or anyone else) to read every paper
>claiming unrealistic speedups, but the hardware just isn't there for the GPU to
>see more than about 10x. When the underlying hardware can't do something that it
>is claimed to do, I think the burden of proof properly belongs on the people making the claim.
>
>-Mark Roulo
>
comp.arch discussion:
http://groups.google.com/group/comp.arch/browse_frm/thread/5aadb32f1b30a7e4/5424f4f8d89e16e9#5424f4f8d89e16e9
In particular this short message by Terje Mathisen
http://groups.google.com/group/comp.arch/msg/d36a9e35dc31c73c
The link, he provided:
http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_Seismic_Hess.pdf
Unfortunately, article contains about zero information about reference "CPU" implementation over which they claim a major gain. However they sound honest, article doesn't leave an impression of NVidia-sponsored cheat.
---------------------------
>AM (myname4rwt@jee-male.com) on 8/2/10 wrote:
>---------------------------
>>David Kanter (dkanter@realworldtech.com) on 7/29/10 wrote:
>>---------------------------
>>...
>>>When I hear crap like "the only interesting workloads are amenable to GPUs", it's
>>>quite annoying. Ditto for claimed 100X speed ups.
>
>
>>And since you are apparently calling crap all 100x and higher speedups, it's reasonable
>>to ask if you have any proof wrt every piece of published research with such results.
>>I don't think you have any though.
>
>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>
>The raw bandwidth advantage is in the 4x to 5x range.
>
>The GPU is less flexible.
>
>I would suggest that when a paper claims a 10x to 20x speed increase above the
>raw hardware advantage, it is up to the paper authors to explain where this advantage came from.
>
>I've been doing GPU programming for about one year now, and
>having looked at a number of these papers (and wandered around the nVidia GPGPU
>conference one year), I can say that the vast majority of the papers claiming more
>than 4-5x performance do it using one or more of the following techniques:
>
>1) Use only one CPU core,
>2) Use scalar CPU code,
>3) Fail to strip-mine the CPU code for good cache locality,
>4) Use a better algorithm on the GPU (e.g. N3 for the CPU, N log (N) for the GPU).
>
>I have seen examples of all four, with (1)-(3) being the most common (and these three often appear together).
>
>My back-of-the-envelope guesstimate is that going from 1 core to 6 on the CPU is
>worth 4x to 6x (we can use 5x as a nice middle ground), using the vector unit is
>4x, and strip-mining can get you 2x or more. Put them together and you get 5x4x2
>= 40x speedup of well optimized CPU code versus simple scalar code. If the GPU
>is being compared to a CPU running simple scalar code, you might see a 100x to 200x
>claim, but this will turn into a 2½x to 5x claim if run against well optimized CPU code.
>
>It is clearly unreasonable to expect David (or anyone else) to read every paper
>claiming unrealistic speedups, but the hardware just isn't there for the GPU to
>see more than about 10x. When the underlying hardware can't do something that it
>is claimed to do, I think the burden of proof properly belongs on the people making the claim.
>
>-Mark Roulo
>
comp.arch discussion:
http://groups.google.com/group/comp.arch/browse_frm/thread/5aadb32f1b30a7e4/5424f4f8d89e16e9#5424f4f8d89e16e9
In particular this short message by Terje Mathisen
http://groups.google.com/group/comp.arch/msg/d36a9e35dc31c73c
The link, he provided:
http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_Seismic_Hess.pdf
Unfortunately, article contains about zero information about reference "CPU" implementation over which they claim a major gain. However they sound honest, article doesn't leave an impression of NVidia-sponsored cheat.