Article: Parallelism at HotPar 2010
By: AM (myname4rwt.delete@this.jee-male.com), August 18, 2010 3:33 am
Room: Moderated Discussions
Steve Underwood (steveu@coppice.org) on 8/17/10 wrote:
---------------------------
>AM (myname4rwt@jee-male.com) on 8/17/10 wrote:
>---------------------------
>>Besides, what I see in the presumably latest cudamcml manual is a 100x rule-of-thumb
>>speedup, with an option for further 10x boost (though they state clearly that the
>>switch disables certain functionality), so at a first glance, the 1000x claim is still there.
>>
>>In what versions of their manuals did they claim 1000x and 50x speedups? I'm not
>>getting any hits to such figures in their manuals as you state.
>
>They hand wave about 100x and 1000x speedup, but look at the tables with specific
>results. The aggressively optimised CUDAMCML is 58x to 137x as fast as a single
>core of an i7 at 2.6GHz running the original unoptimised code, depending on the
>parameters chosen. With the -A option it is 272x to 276x as fast. They seem to have
>no results that justify the claim of a 10x speed up for the -A option. 276x is the
>best case shown. Comparing of the -A option with the CPU code is a little unfair,
>as they are not performing the same job.
I've seen the figures and you shouldn't pretend that the comment regarding unfair comparison with the -A option is yours rather than theirs (it's on the very same page as speedups). If someone wanted to do handwaving only, he or she wouldn't be interested in any detailed disclosures.
>Using the 8 threads of the i7 effectively would probably give a 5x or 6x boost
>on this very parallelisable problem. Using SSE would probably give another 2x boost.
>That would bring the 137x down to something more like 13x.
You should check the hw they used: 9800 GT. $80 won't buy you a Core i7.
How much improvement
>might come from optimising the CPU code is open to question. If someone claims they
>will show only a 2x speedup later this year, it sounds like there may be quite a
>lot of CPU performance being wasted right now.
Look dude, the folks made results of their work available for free and they make it clear what hw the speedups refer to. If you manage to find something fishy or erroneous about their work or presentation of results, you're free to report it. But if you don't like how fast the CPU code works, the only way to fix that is to provide a better code than the one used for comparison. For me being 2x faster at similar added price could be enough to forget about optimizing for the slower platform completely.
>13x or more is a pretty nice speed up for a low cost solution. Even 2x is not insignificant
>if the code is very heavily used by a large audience(e.g. video codecs), which can
>justify the extra development effort. However, MCML is pretty much a best case problem.
>Most number crunching tasking will not fit the hardware nearly as well.
---------------------------
>AM (myname4rwt@jee-male.com) on 8/17/10 wrote:
>---------------------------
>>Besides, what I see in the presumably latest cudamcml manual is a 100x rule-of-thumb
>>speedup, with an option for further 10x boost (though they state clearly that the
>>switch disables certain functionality), so at a first glance, the 1000x claim is still there.
>>
>>In what versions of their manuals did they claim 1000x and 50x speedups? I'm not
>>getting any hits to such figures in their manuals as you state.
>
>They hand wave about 100x and 1000x speedup, but look at the tables with specific
>results. The aggressively optimised CUDAMCML is 58x to 137x as fast as a single
>core of an i7 at 2.6GHz running the original unoptimised code, depending on the
>parameters chosen. With the -A option it is 272x to 276x as fast. They seem to have
>no results that justify the claim of a 10x speed up for the -A option. 276x is the
>best case shown. Comparing of the -A option with the CPU code is a little unfair,
>as they are not performing the same job.
I've seen the figures and you shouldn't pretend that the comment regarding unfair comparison with the -A option is yours rather than theirs (it's on the very same page as speedups). If someone wanted to do handwaving only, he or she wouldn't be interested in any detailed disclosures.
>Using the 8 threads of the i7 effectively would probably give a 5x or 6x boost
>on this very parallelisable problem. Using SSE would probably give another 2x boost.
>That would bring the 137x down to something more like 13x.
You should check the hw they used: 9800 GT. $80 won't buy you a Core i7.
How much improvement
>might come from optimising the CPU code is open to question. If someone claims they
>will show only a 2x speedup later this year, it sounds like there may be quite a
>lot of CPU performance being wasted right now.
Look dude, the folks made results of their work available for free and they make it clear what hw the speedups refer to. If you manage to find something fishy or erroneous about their work or presentation of results, you're free to report it. But if you don't like how fast the CPU code works, the only way to fix that is to provide a better code than the one used for comparison. For me being 2x faster at similar added price could be enough to forget about optimizing for the slower platform completely.
>13x or more is a pretty nice speed up for a low cost solution. Even 2x is not insignificant
>if the code is very heavily used by a large audience(e.g. video codecs), which can
>justify the extra development effort. However, MCML is pretty much a best case problem.
>Most number crunching tasking will not fit the hardware nearly as well.