Article: Parallelism at HotPar 2010
By: Mark Roulo (nothanks.delete@this.xxx.com), August 2, 2010 9:41 am
Room: Moderated Discussions
AM (myname4rwt@jee-male.com) on 8/2/10 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 7/29/10 wrote:
>---------------------------
>...
>>When I hear crap like "the only interesting workloads are amenable to GPUs", it's
>>quite annoying. Ditto for claimed 100X speed ups.
>And since you are apparently calling crap all 100x and higher speedups, it's reasonable
>to ask if you have any proof wrt every piece of published research with such results.
>I don't think you have any though.
The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
The raw bandwidth advantage is in the 4x to 5x range.
The GPU is less flexible.
I would suggest that when a paper claims a 10x to 20x speed increase above the raw hardware advantage, it is up to the paper authors to explain where this advantage came from.
I've been doing GPU programming for about one year now, and
having looked at a number of these papers (and wandered around the nVidia GPGPU conference one year), I can say that the vast majority of the papers claiming more than 4-5x performance do it using one or more of the following techniques:
1) Use only one CPU core,
2) Use scalar CPU code,
3) Fail to strip-mine the CPU code for good cache locality,
4) Use a better algorithm on the GPU (e.g. N3 for the CPU, N log (N) for the GPU).
I have seen examples of all four, with (1)-(3) being the most common (and these three often appear together).
My back-of-the-envelope guesstimate is that going from 1 core to 6 on the CPU is worth 4x to 6x (we can use 5x as a nice middle ground), using the vector unit is 4x, and strip-mining can get you 2x or more. Put them together and you get 5x4x2 = 40x speedup of well optimized CPU code versus simple scalar code. If the GPU is being compared to a CPU running simple scalar code, you might see a 100x to 200x claim, but this will turn into a 2½x to 5x claim if run against well optimized CPU code.
It is clearly unreasonable to expect David (or anyone else) to read every paper claiming unrealistic speedups, but the hardware just isn't there for the GPU to see more than about 10x. When the underlying hardware can't do something that it is claimed to do, I think the burden of proof properly belongs on the people making the claim.
-Mark Roulo
---------------------------
>David Kanter (dkanter@realworldtech.com) on 7/29/10 wrote:
>---------------------------
>...
>>When I hear crap like "the only interesting workloads are amenable to GPUs", it's
>>quite annoying. Ditto for claimed 100X speed ups.
>And since you are apparently calling crap all 100x and higher speedups, it's reasonable
>to ask if you have any proof wrt every piece of published research with such results.
>I don't think you have any though.
The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
The raw bandwidth advantage is in the 4x to 5x range.
The GPU is less flexible.
I would suggest that when a paper claims a 10x to 20x speed increase above the raw hardware advantage, it is up to the paper authors to explain where this advantage came from.
I've been doing GPU programming for about one year now, and
having looked at a number of these papers (and wandered around the nVidia GPGPU conference one year), I can say that the vast majority of the papers claiming more than 4-5x performance do it using one or more of the following techniques:
1) Use only one CPU core,
2) Use scalar CPU code,
3) Fail to strip-mine the CPU code for good cache locality,
4) Use a better algorithm on the GPU (e.g. N3 for the CPU, N log (N) for the GPU).
I have seen examples of all four, with (1)-(3) being the most common (and these three often appear together).
My back-of-the-envelope guesstimate is that going from 1 core to 6 on the CPU is worth 4x to 6x (we can use 5x as a nice middle ground), using the vector unit is 4x, and strip-mining can get you 2x or more. Put them together and you get 5x4x2 = 40x speedup of well optimized CPU code versus simple scalar code. If the GPU is being compared to a CPU running simple scalar code, you might see a 100x to 200x claim, but this will turn into a 2½x to 5x claim if run against well optimized CPU code.
It is clearly unreasonable to expect David (or anyone else) to read every paper claiming unrealistic speedups, but the hardware just isn't there for the GPU to see more than about 10x. When the underlying hardware can't do something that it is claimed to do, I think the burden of proof properly belongs on the people making the claim.
-Mark Roulo