Article: Parallelism at HotPar 2010
By: anon (anon.delete@this.anon.com), August 4, 2010 12:12 am
Room: Moderated Discussions
AM (myname4rwt@jee-male.com) on 8/4/10 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 8/3/10 wrote:
>---------------------------
>>AM (myname4rwt@jee-male.com) on 8/3/10 wrote:
>>---------------------------
>>>Mark Roulo (nothanks@xxx.com) on 8/2/10 wrote:
>>>---------------------------
>>>>AM (myname4rwt@jee-male.com) on 8/2/10 wrote:
>>>>---------------------------
>>>>>David Kanter (dkanter@realworldtech.com) on 7/29/10 wrote:
>>>>>---------------------------
>>>>>...
>>>>>>When I hear crap like "the only interesting workloads are amenable to GPUs", it's
>>>>>>quite annoying. Ditto for claimed 100X speed ups.
>>>>
>>>>
>>>>>And since you are apparently calling crap all 100x and higher speedups, it's reasonable
>>>>>to ask if you have any proof wrt every piece of published research with such results.
>>>>>I don't think you have any though.
>>>>
>>>>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>>>>
>>>>The raw bandwidth advantage is in the 4x to 5x range.
>>>>
>>>>The GPU is less flexible.
>>>>
>>>>I would suggest that when a paper claims a 10x to 20x speed increase above the
>>>>raw hardware advantage, it is up to the paper authors to explain where this advantage came from.
>>>>
>>>>I've been doing GPU programming for about one year now, and
>>>>having looked at a number of these papers (and wandered around the nVidia GPGPU
>>>>conference one year), I can say that the vast majority of the papers claiming more
>>>>than 4-5x performance do it using one or more of the following techniques:
>>>>
>>>>1) Use only one CPU core,
>>>>2) Use scalar CPU code,
>>>>3) Fail to strip-mine the CPU code for good cache locality,
>>>>4) Use a better algorithm on the GPU (e.g. N3 for the CPU, N log (N) for the GPU).
>>>>
>>>>I have seen examples of all four, with (1)-(3) being the most common (and these three often appear together).
>>>>
>>>>My back-of-the-envelope guesstimate is that going from 1 core to 6 on the CPU is
>>>>worth 4x to 6x (we can use 5x as a nice middle ground), using the vector unit is
>>>>4x, and strip-mining can get you 2x or more. Put them together and you get 5x4x2
>>>>= 40x speedup of well optimized CPU code versus simple scalar code. If the GPU
>>>>is being compared to a CPU running simple scalar code, you might see a 100x to 200x
>>>>claim, but this will turn into a 2½x to 5x claim if run against well optimized CPU code.
>>>>
>>>>It is clearly unreasonable to expect David (or anyone else) to read every paper
>>>>claiming unrealistic speedups, but the hardware just isn't there for the GPU to
>>>>see more than about 10x. When the underlying hardware can't do something that it
>>>>is claimed to do, I think the burden of proof properly belongs on the people making the claim.
>>>>
>>>>-Mark Roulo
>>>
>>>Here is a very simple reality check for you (and David): get a machine with win7
>>>on and check how fast warp can render, say, Crysis (use the benchmark tool). GTX
>>>460 (available from $200 these days) cranks out over 30 fps in 1680x1050, VHD and
>>>over 60 fps (GASP) in SLI, same mode. And very short of >30/60 fps in 1920x1080, VHD from the report I saw.
>>
>>Wow, that tells us so much about GPGPU! It's a really good thing that rendering
>>the output of a game LOOKS EXACTLY like real work, such as computational fluid dynamics.
>>
>>Seriously, what planet do you live on?
>>
>>Perhaps it would behoove you to read what mark wrote. He's been writing GPGPU
>>code for a year. I would expect that he understands the situation quite well.
>>
>>>How fast do you think CPU can handle this task (btw, a >representative of a very
>>>widespread class of workloads), even the Intel's 6 core >you mentioned? And how many
>>>Intel's 6-core crown jewels selling for $1k+ a pop will it >take to get the same performance?
>>
>>Representative of some games perhaps. We are talking about GPGPU.
>>
>>>Have a nice time reevaluating your claims (or better yet, >running the test and reporting the results).
>>>
>>>PS Reportedly, Warp provides good scalability with core >count and makes good use
>>>even of SSE 4.1, so I suggest you should pull some >evidence before you start
>talking about poorly-written >code here.
>>
>>Who cares about rendering, we are talking about GPGPU...
>
>BS (nice try though). Read his post again (bw and FP advantage,
You think that is BS? Then what is your number for BW and FP advantages of GPUs?
> poor CPU codes,
This has been demonstrated to be the case on occasion.
>and the conclusion that the real gap (when compared against well-optimized CPU codes)
>is 2.5x-5x).
I think that is a guess, but it should be much closer to the mark than 100x. Anything over 10x would be very interesting and if it is a fair comparison is likely using some specialized GPU function in the device.
> The guy is full of it. And so are you, unless you can *really* back
>your 100x-claims-are-crap statements by showing that every published work that reported
>such speedups is not worth considering for being misleading, wrong etc.
The statement is that 100x is probably crap if comparing with contemporary CPUs and result-equivalent and equally highly optimized and parallel MIMD+SIMD code on the CPU.
Which papers are claiming 100x? Are they also stating which CPUs were compared against, and that they were comparing equivalent code and using all the parallelism in the CPU? Please point out those papers.
>>Now repeat after me: GPGPU != games, GPGPU != games, GPGPU != games.
>>
>>David
>
>Well, believe it or not, but *the* market for GPUs is
Stop going off topic.
---------------------------
>David Kanter (dkanter@realworldtech.com) on 8/3/10 wrote:
>---------------------------
>>AM (myname4rwt@jee-male.com) on 8/3/10 wrote:
>>---------------------------
>>>Mark Roulo (nothanks@xxx.com) on 8/2/10 wrote:
>>>---------------------------
>>>>AM (myname4rwt@jee-male.com) on 8/2/10 wrote:
>>>>---------------------------
>>>>>David Kanter (dkanter@realworldtech.com) on 7/29/10 wrote:
>>>>>---------------------------
>>>>>...
>>>>>>When I hear crap like "the only interesting workloads are amenable to GPUs", it's
>>>>>>quite annoying. Ditto for claimed 100X speed ups.
>>>>
>>>>
>>>>>And since you are apparently calling crap all 100x and higher speedups, it's reasonable
>>>>>to ask if you have any proof wrt every piece of published research with such results.
>>>>>I don't think you have any though.
>>>>
>>>>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>>>>
>>>>The raw bandwidth advantage is in the 4x to 5x range.
>>>>
>>>>The GPU is less flexible.
>>>>
>>>>I would suggest that when a paper claims a 10x to 20x speed increase above the
>>>>raw hardware advantage, it is up to the paper authors to explain where this advantage came from.
>>>>
>>>>I've been doing GPU programming for about one year now, and
>>>>having looked at a number of these papers (and wandered around the nVidia GPGPU
>>>>conference one year), I can say that the vast majority of the papers claiming more
>>>>than 4-5x performance do it using one or more of the following techniques:
>>>>
>>>>1) Use only one CPU core,
>>>>2) Use scalar CPU code,
>>>>3) Fail to strip-mine the CPU code for good cache locality,
>>>>4) Use a better algorithm on the GPU (e.g. N3 for the CPU, N log (N) for the GPU).
>>>>
>>>>I have seen examples of all four, with (1)-(3) being the most common (and these three often appear together).
>>>>
>>>>My back-of-the-envelope guesstimate is that going from 1 core to 6 on the CPU is
>>>>worth 4x to 6x (we can use 5x as a nice middle ground), using the vector unit is
>>>>4x, and strip-mining can get you 2x or more. Put them together and you get 5x4x2
>>>>= 40x speedup of well optimized CPU code versus simple scalar code. If the GPU
>>>>is being compared to a CPU running simple scalar code, you might see a 100x to 200x
>>>>claim, but this will turn into a 2½x to 5x claim if run against well optimized CPU code.
>>>>
>>>>It is clearly unreasonable to expect David (or anyone else) to read every paper
>>>>claiming unrealistic speedups, but the hardware just isn't there for the GPU to
>>>>see more than about 10x. When the underlying hardware can't do something that it
>>>>is claimed to do, I think the burden of proof properly belongs on the people making the claim.
>>>>
>>>>-Mark Roulo
>>>
>>>Here is a very simple reality check for you (and David): get a machine with win7
>>>on and check how fast warp can render, say, Crysis (use the benchmark tool). GTX
>>>460 (available from $200 these days) cranks out over 30 fps in 1680x1050, VHD and
>>>over 60 fps (GASP) in SLI, same mode. And very short of >30/60 fps in 1920x1080, VHD from the report I saw.
>>
>>Wow, that tells us so much about GPGPU! It's a really good thing that rendering
>>the output of a game LOOKS EXACTLY like real work, such as computational fluid dynamics.
>>
>>Seriously, what planet do you live on?
>>
>>Perhaps it would behoove you to read what mark wrote. He's been writing GPGPU
>>code for a year. I would expect that he understands the situation quite well.
>>
>>>How fast do you think CPU can handle this task (btw, a >representative of a very
>>>widespread class of workloads), even the Intel's 6 core >you mentioned? And how many
>>>Intel's 6-core crown jewels selling for $1k+ a pop will it >take to get the same performance?
>>
>>Representative of some games perhaps. We are talking about GPGPU.
>>
>>>Have a nice time reevaluating your claims (or better yet, >running the test and reporting the results).
>>>
>>>PS Reportedly, Warp provides good scalability with core >count and makes good use
>>>even of SSE 4.1, so I suggest you should pull some >evidence before you start
>talking about poorly-written >code here.
>>
>>Who cares about rendering, we are talking about GPGPU...
>
>BS (nice try though). Read his post again (bw and FP advantage,
You think that is BS? Then what is your number for BW and FP advantages of GPUs?
> poor CPU codes,
This has been demonstrated to be the case on occasion.
>and the conclusion that the real gap (when compared against well-optimized CPU codes)
>is 2.5x-5x).
I think that is a guess, but it should be much closer to the mark than 100x. Anything over 10x would be very interesting and if it is a fair comparison is likely using some specialized GPU function in the device.
> The guy is full of it. And so are you, unless you can *really* back
>your 100x-claims-are-crap statements by showing that every published work that reported
>such speedups is not worth considering for being misleading, wrong etc.
The statement is that 100x is probably crap if comparing with contemporary CPUs and result-equivalent and equally highly optimized and parallel MIMD+SIMD code on the CPU.
Which papers are claiming 100x? Are they also stating which CPUs were compared against, and that they were comparing equivalent code and using all the parallelism in the CPU? Please point out those papers.
>>Now repeat after me: GPGPU != games, GPGPU != games, GPGPU != games.
>>
>>David
>
>Well, believe it or not, but *the* market for GPUs is
Stop going off topic.