Article: Parallelism at HotPar 2010
By: AM (myname4rwt.delete@this.jee-male.com), August 19, 2010 1:29 am
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 8/18/10 wrote:
---------------------------
>AM (myname4rwt@jee-male.com) on 8/18/10 wrote:
>---------------------------
>>Michael S (already5chosen@yahoo.com) on 8/17/10 wrote:
>>---------------------------
>>>AM (myname4rwt@jee-male.com) on 8/17/10 wrote:
>>>---------------------------
>>>>anon (anon@anon.com) on 8/16/10 wrote:
>>>>---------------------------
>>>>>AM (myname4rwt@jee-male.com) on 8/16/10 wrote:
>>>>>---------------------------
>>>>>>anon (anon@anon.com) on 8/14/10 wrote:
>>>>>>---------------------------
>>>>>>>So in summary, nobody has been able to come up with a credible paper proving these
>>>>>>>fantastic 100x performance gains despite being absolutely certain the claims are real.
>>>>>>
>>>>>>In summary, claims that GPUs' perf advantage is limited to 2.5x-5x are complete
>>>>>>and utter BS,
>>>>>
>>>>>AM the claim, actually, is coming from the 100x-1000x people. People here are doubting
>>>>>that claim because of the numbers involved, but it is not up to them to disprove
>>>>
>>>>As a matter of fact, some people here (to be precise, Mark Roulo) asserted that
>>>>the "real" perf advantage of GPUs over CPUs is 2.5x-5x. Hence my suggestion to him
>>>>(and David Kanter wrt his own statements) to show how can it possibly be true with
>>>>a selection of papers as small as 10. Haven't seen anyone of them in the thread since.
>>>
>>>I'd like to hear what is your own claim?
>>>Say, on CPU side i5-760 (4 cores, 2.8 GHz, 8MB L3, no HT, official price=$205),
>>>multithreaded , SIMD-optimized (not by compiler).
>>
>>If we allow hand optimization for the CPU, it's obvious that the same should apply
>>to the GPU side. Or rely on compiler in both cases.
>>
>
>Did NVidea disclose the their ISA and provided the tools required for hand optimization? Methinks not.
Not to the bare metal I think, but decuda/cudasm by Wladimir Jasper van der Laan have been around for years. Besides, have you read about PTX? Check the ISA docs on Nvidia's site.
>On the other hand, hand-optimization for x86 SIMD is very well supported by plenty
>of tools (compiler intrinsics, debuggers, profilers). In short, hand-optimizing
>below standard C/Fortran on Nehalem is practical and done in practice by tens of
>thousands of devs. Hand-optimizing below CUDA on NVidea is not practical and likely
>not even possible for non-NVidea devs.
Not only possible, but actually done.
>>>On GPU side best NVidea can offer under $250 (official price, not random internet link).
>>
>>I wonder what you mean by the official price. AFAIK Nvidia doesn't have some kind
>>of publicly available price list; I think all they do is [sometime] quote target
>>prices for complete products in PRs at the time of release, but needless to say, they change over time.
>>
>>Besides, LGA 1156 is a cost adder that should be taken into account as well (after
>>all, I agreed with correction for CPU price). Which is $30-$40 as of today, cheap vs cheap.
>
>Not sure about it. LGA 1366 is a serious cost adder. LGA 1156 is already close
>to parity with LGA 775 and the gap continues to shrink.
If $40 is close to parity, then we don't have much of disagreement here.
Anyway. If you're thinking in earnest of what GPUs can offer for under $250, my picks are as follows:
GTX 470, 275, 280 and 9800 GX2, each new, fit the bill nicely:
http://www.google.com/products?q=gtx+470&scoring=p&lnk=pruser&price1=200&price2=
http://www.google.com/products?q=gtx+275&scoring=p&lnk=pruser&price1=200&price2=
http://www.google.com/products?q=gtx+280&scoring=p&show=dd&sa=N&lnk=pruser&price1=190&price2=
http://www.google.com/products?q=9800+gx2&scoring=p&lnk=pruser&price1=180&price2=
From a brief look at specs, each has a chance to be the fastest when running certain code, in particular when codes deeply tuned for G80 or GT200 uarch are concerned.
From the AMD camp, HD5850 fits easily:
http://www.google.com/products?q=hd5850&scoring=p&lnk=pruser&price1=200&price2=
and there's a very interesting deal on allegedly new HD 4870x2 which is $29 over the point specified by you, but it's too good to miss and fits with the LGA 1156 cost adder:
http://www.google.com/products?q=hd+4870+x2&scoring=p&lnk=pruser&price1=200&price2=
I'm likely to take a week, maybe two off any day now, so have fun if you're going to get one or two of these; recalling some of your work with all due respect, I'd be curious to see how successful you might be with CPU codes for the short list of 10 papers + cudamcml (your price point seems to be very good for the CPU: 4 cores and large L3).
And no, insisting on hand-tuned SIMD but disallowing hand tuning for the GPU along with any of GPU's HW capabilities is a ridiculous proposition.
>>>Calculation should not take advantage of texture interpolation capabilities (using
>>>texture cache is, not only allowed but highly desireble) of GPU since, first, it's
>>>extremely rare in non-3D-rendering code and as such non-representative, second,
>>>when it happens everybody, including David Kanter and Mark Roulo probably agree that >50x speedup is possible.
>>>So what speedup do you expect under conditions like above?
>>
>>Why are you saying interpolation is extremely rare? Whenever we solve numerically
>>a problem which simulates something by representing the reality in discretized form,
>>we can't do without interpolation as long as we want our functions to be at least C0-continuous.
>>
>
>Do you realize that texture unit does interpolation on low-precision fix-point
>numbers? It's not good enough even for sonar/radar beam forming. Applicability to
>traditional HPC is extremely rare. Except, may be, final visualization phase, but
>that better handled by classic GPU APIs rather than GPGPU.
IIRC, 32-bit FP textures were in Nvidia's GPUs since series 6 hw.
>>Besides, making good use of texture-interpolation hw was specifically emphasized
>>in one of the papers Intel selected to "debunk" the "myth", but all they did is
>>take some other Monte-Carlo code instead for comparison.
>>
>>As for your questions re. my claim/expectation,
>>1) I joined when I saw the comment by Kanter which seems to not be supported with any research/work whatsoever;
>>2) if we are to take a closer look at i5-760 vs GPU, then we need to clear possible
>>disagreements first anyway (hence my comments above);
>>3) so far the only article which was clearly shown to be totally misleading is the one by Lee et al.
>>
>>http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1
>>
>>"Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU"
---------------------------
>AM (myname4rwt@jee-male.com) on 8/18/10 wrote:
>---------------------------
>>Michael S (already5chosen@yahoo.com) on 8/17/10 wrote:
>>---------------------------
>>>AM (myname4rwt@jee-male.com) on 8/17/10 wrote:
>>>---------------------------
>>>>anon (anon@anon.com) on 8/16/10 wrote:
>>>>---------------------------
>>>>>AM (myname4rwt@jee-male.com) on 8/16/10 wrote:
>>>>>---------------------------
>>>>>>anon (anon@anon.com) on 8/14/10 wrote:
>>>>>>---------------------------
>>>>>>>So in summary, nobody has been able to come up with a credible paper proving these
>>>>>>>fantastic 100x performance gains despite being absolutely certain the claims are real.
>>>>>>
>>>>>>In summary, claims that GPUs' perf advantage is limited to 2.5x-5x are complete
>>>>>>and utter BS,
>>>>>
>>>>>AM the claim, actually, is coming from the 100x-1000x people. People here are doubting
>>>>>that claim because of the numbers involved, but it is not up to them to disprove
>>>>
>>>>As a matter of fact, some people here (to be precise, Mark Roulo) asserted that
>>>>the "real" perf advantage of GPUs over CPUs is 2.5x-5x. Hence my suggestion to him
>>>>(and David Kanter wrt his own statements) to show how can it possibly be true with
>>>>a selection of papers as small as 10. Haven't seen anyone of them in the thread since.
>>>
>>>I'd like to hear what is your own claim?
>>>Say, on CPU side i5-760 (4 cores, 2.8 GHz, 8MB L3, no HT, official price=$205),
>>>multithreaded , SIMD-optimized (not by compiler).
>>
>>If we allow hand optimization for the CPU, it's obvious that the same should apply
>>to the GPU side. Or rely on compiler in both cases.
>>
>
>Did NVidea disclose the their ISA and provided the tools required for hand optimization? Methinks not.
Not to the bare metal I think, but decuda/cudasm by Wladimir Jasper van der Laan have been around for years. Besides, have you read about PTX? Check the ISA docs on Nvidia's site.
>On the other hand, hand-optimization for x86 SIMD is very well supported by plenty
>of tools (compiler intrinsics, debuggers, profilers). In short, hand-optimizing
>below standard C/Fortran on Nehalem is practical and done in practice by tens of
>thousands of devs. Hand-optimizing below CUDA on NVidea is not practical and likely
>not even possible for non-NVidea devs.
Not only possible, but actually done.
>>>On GPU side best NVidea can offer under $250 (official price, not random internet link).
>>
>>I wonder what you mean by the official price. AFAIK Nvidia doesn't have some kind
>>of publicly available price list; I think all they do is [sometime] quote target
>>prices for complete products in PRs at the time of release, but needless to say, they change over time.
>>
>>Besides, LGA 1156 is a cost adder that should be taken into account as well (after
>>all, I agreed with correction for CPU price). Which is $30-$40 as of today, cheap vs cheap.
>
>Not sure about it. LGA 1366 is a serious cost adder. LGA 1156 is already close
>to parity with LGA 775 and the gap continues to shrink.
If $40 is close to parity, then we don't have much of disagreement here.
Anyway. If you're thinking in earnest of what GPUs can offer for under $250, my picks are as follows:
GTX 470, 275, 280 and 9800 GX2, each new, fit the bill nicely:
http://www.google.com/products?q=gtx+470&scoring=p&lnk=pruser&price1=200&price2=
http://www.google.com/products?q=gtx+275&scoring=p&lnk=pruser&price1=200&price2=
http://www.google.com/products?q=gtx+280&scoring=p&show=dd&sa=N&lnk=pruser&price1=190&price2=
http://www.google.com/products?q=9800+gx2&scoring=p&lnk=pruser&price1=180&price2=
From a brief look at specs, each has a chance to be the fastest when running certain code, in particular when codes deeply tuned for G80 or GT200 uarch are concerned.
From the AMD camp, HD5850 fits easily:
http://www.google.com/products?q=hd5850&scoring=p&lnk=pruser&price1=200&price2=
and there's a very interesting deal on allegedly new HD 4870x2 which is $29 over the point specified by you, but it's too good to miss and fits with the LGA 1156 cost adder:
http://www.google.com/products?q=hd+4870+x2&scoring=p&lnk=pruser&price1=200&price2=
I'm likely to take a week, maybe two off any day now, so have fun if you're going to get one or two of these; recalling some of your work with all due respect, I'd be curious to see how successful you might be with CPU codes for the short list of 10 papers + cudamcml (your price point seems to be very good for the CPU: 4 cores and large L3).
And no, insisting on hand-tuned SIMD but disallowing hand tuning for the GPU along with any of GPU's HW capabilities is a ridiculous proposition.
>>>Calculation should not take advantage of texture interpolation capabilities (using
>>>texture cache is, not only allowed but highly desireble) of GPU since, first, it's
>>>extremely rare in non-3D-rendering code and as such non-representative, second,
>>>when it happens everybody, including David Kanter and Mark Roulo probably agree that >50x speedup is possible.
>>>So what speedup do you expect under conditions like above?
>>
>>Why are you saying interpolation is extremely rare? Whenever we solve numerically
>>a problem which simulates something by representing the reality in discretized form,
>>we can't do without interpolation as long as we want our functions to be at least C0-continuous.
>>
>
>Do you realize that texture unit does interpolation on low-precision fix-point
>numbers? It's not good enough even for sonar/radar beam forming. Applicability to
>traditional HPC is extremely rare. Except, may be, final visualization phase, but
>that better handled by classic GPU APIs rather than GPGPU.
IIRC, 32-bit FP textures were in Nvidia's GPUs since series 6 hw.
>>Besides, making good use of texture-interpolation hw was specifically emphasized
>>in one of the papers Intel selected to "debunk" the "myth", but all they did is
>>take some other Monte-Carlo code instead for comparison.
>>
>>As for your questions re. my claim/expectation,
>>1) I joined when I saw the comment by Kanter which seems to not be supported with any research/work whatsoever;
>>2) if we are to take a closer look at i5-760 vs GPU, then we need to clear possible
>>disagreements first anyway (hence my comments above);
>>3) so far the only article which was clearly shown to be totally misleading is the one by Lee et al.
>>
>>http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1
>>
>>"Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU"