Article: Parallelism at HotPar 2010
By: AM (myname4rwt.delete@this.jee-male.com), August 13, 2010 3:28 am
Room: Moderated Discussions
Apparently, Mark Roulo and David Kanter have nothing to back their statements with. No wonder, as a claim that GPU's perf advantage is limited to 2.5x-5x is as silly as Intel's infamous GPU-myth-debunking paper.
In case anyone has already forgotten or didn't care to read that piece, Intel states therein: "To the best of our knowledge, our performance numbers are at least on par and often better than the best published data."
Sounds like an almost fair bakeoff then, right? Well, at least as fair as it could be for the range of codes selected by Intel to present the myth-debunking case to the world.
They open up with "We reexamine a number of claims [9,19,21,32,42,45,47,53] that GPUs perform 10X to 1000X better than CPUs on a number of throughput kernels/applications. After tuning the code for BOTH CPU and GPU, we ?nd the GPU only performs 2.5X better than CPU." (*)
and conclude with "Our results are far less than previous claims like the 50X difference in pricing European options using Monte Carlo method [9], the 114X difference in LBM [45], the 40X difference in FFT [21], the 50X difference in sparse matrix vector multiplication [47] and the 40X difference in histogram computation [53], etc."
Aha! So all it took to accelerate original codes by a factor of tens is mere tuning of the code, even if it was done only for the CPU, which many of us would happily forgive Intel if they provided the tuned versions of the code on their website -- both in traditions of science and to promote sales of their hardware?!
Had it really been that way (and with codes made available), Intel could have scored nicely both with general public and with the folks who optimize to the last oomph for living.
Here is where the fun begins. Buried in the midst of their manuscript is the following: "For some of the kernels, we have used the best available implementation that already existed. Speci?cally, evaluations of SGEMM, SpMV, FFT and MC on GTX280 have been done using code from [1,8,2,34], respectively."
I was totally unalerted when I read the paper for the first time, as I took Intel's wording of "the best available implementation" for face value. IOW, at least as good as used in the papers they chose to debunk the myth with. Or even better.
It was almost by accident that I got back to this paper for references and some figures and found that e.g. SpMV used for comparison uses a new format (ELLPACK-R) and a new implementation, and the paper [8] by Bell and Garland used by Intel for debunking is even referenced by the authors of [47] as an existing SpMV code; that the only thing "Monte-Carlo Option Pricing" presentation by Podlozhnyuk and Harris [34] they used appears to have in common with [9] by Bennemann et al. is, well, "Monte Carlo"! Here is an excerpt from the latter, btw:
"Our implementation takes advantage of the texture memory provid-
ed by the GPU. Texture memory is accessible as a one, two or three-
dimensional lookup table, with interpolation between the nodes realized
in hardware. In Figure 3 the bilinear interpolation is depicted. Assuming
the texture map to be defined on an integer lattice, the expression
f (u, v) = (1 ? ?)(1 ? ?)f [floor(u), floor(v)] + ?(1 ? ?)f [ceil(u), floor(v)]
+(1 ? ?)?f [floor(u), ceil(v)] + ??f [ceil(u), ceil(v)]
interpolates for arbitrary real coordinates u, v, where ? and ? are the
fractional parts of the coordinates. This formula is implemented on the
hardware level within the texture addressing unit. What would corre-
spond to four separate memory fetches and several integer and floating
point operations on a normal architecture happens on the GPU at the
speed of a single memory access."
Apparently, Intel would rather "debunk" this code with some other Monte-Carlo simulation.
And even CUFFT [2] used by Intel for comparison is the GPU code authors of [21] compared theirs against. Sorry Intel, but with such methodology you don't debunk things. All you have shown is that even the codes you have personally selected to debunk the "myth" with are too good for you to deal with without resorting to such ridiculous tactics.
=========
* I'm not sure if 1000x claim was really made in any of the papers they quoted. Probably not, as otherwise Intel could have used that claim to debunk the 1000x myth instead. :)
In case anyone has already forgotten or didn't care to read that piece, Intel states therein: "To the best of our knowledge, our performance numbers are at least on par and often better than the best published data."
Sounds like an almost fair bakeoff then, right? Well, at least as fair as it could be for the range of codes selected by Intel to present the myth-debunking case to the world.
They open up with "We reexamine a number of claims [9,19,21,32,42,45,47,53] that GPUs perform 10X to 1000X better than CPUs on a number of throughput kernels/applications. After tuning the code for BOTH CPU and GPU, we ?nd the GPU only performs 2.5X better than CPU." (*)
and conclude with "Our results are far less than previous claims like the 50X difference in pricing European options using Monte Carlo method [9], the 114X difference in LBM [45], the 40X difference in FFT [21], the 50X difference in sparse matrix vector multiplication [47] and the 40X difference in histogram computation [53], etc."
Aha! So all it took to accelerate original codes by a factor of tens is mere tuning of the code, even if it was done only for the CPU, which many of us would happily forgive Intel if they provided the tuned versions of the code on their website -- both in traditions of science and to promote sales of their hardware?!
Had it really been that way (and with codes made available), Intel could have scored nicely both with general public and with the folks who optimize to the last oomph for living.
Here is where the fun begins. Buried in the midst of their manuscript is the following: "For some of the kernels, we have used the best available implementation that already existed. Speci?cally, evaluations of SGEMM, SpMV, FFT and MC on GTX280 have been done using code from [1,8,2,34], respectively."
I was totally unalerted when I read the paper for the first time, as I took Intel's wording of "the best available implementation" for face value. IOW, at least as good as used in the papers they chose to debunk the myth with. Or even better.
It was almost by accident that I got back to this paper for references and some figures and found that e.g. SpMV used for comparison uses a new format (ELLPACK-R) and a new implementation, and the paper [8] by Bell and Garland used by Intel for debunking is even referenced by the authors of [47] as an existing SpMV code; that the only thing "Monte-Carlo Option Pricing" presentation by Podlozhnyuk and Harris [34] they used appears to have in common with [9] by Bennemann et al. is, well, "Monte Carlo"! Here is an excerpt from the latter, btw:
"Our implementation takes advantage of the texture memory provid-
ed by the GPU. Texture memory is accessible as a one, two or three-
dimensional lookup table, with interpolation between the nodes realized
in hardware. In Figure 3 the bilinear interpolation is depicted. Assuming
the texture map to be defined on an integer lattice, the expression
f (u, v) = (1 ? ?)(1 ? ?)f [floor(u), floor(v)] + ?(1 ? ?)f [ceil(u), floor(v)]
+(1 ? ?)?f [floor(u), ceil(v)] + ??f [ceil(u), ceil(v)]
interpolates for arbitrary real coordinates u, v, where ? and ? are the
fractional parts of the coordinates. This formula is implemented on the
hardware level within the texture addressing unit. What would corre-
spond to four separate memory fetches and several integer and floating
point operations on a normal architecture happens on the GPU at the
speed of a single memory access."
Apparently, Intel would rather "debunk" this code with some other Monte-Carlo simulation.
And even CUFFT [2] used by Intel for comparison is the GPU code authors of [21] compared theirs against. Sorry Intel, but with such methodology you don't debunk things. All you have shown is that even the codes you have personally selected to debunk the "myth" with are too good for you to deal with without resorting to such ridiculous tactics.
=========
* I'm not sure if 1000x claim was really made in any of the papers they quoted. Probably not, as otherwise Intel could have used that claim to debunk the 1000x myth instead. :)