Article: Parallelism at HotPar 2010
By: Jouni Osmala (josmala.delete@this.cc.hut.fi), August 4, 2010 10:10 pm
Room: Moderated Discussions
>Well, me, too.
>
>If the paper says, "Hey, or GPU implementation crushes the CPU because we found
>a way to use the dedicated GPU hardware," then I'm fine. I also know that if my
>problem *can't* use this H/W, then the claimed speedups won't apply to my problems.
>
>Without using the dedicated GPU H/W, though, I'm back to my basic claim: Showing
>a speedup greater than the extra available compute I find dubious without an explanation.
There is type of vector parallerism that SSE instructions do not get and GPU instructions get enable based on the radeon ISA documentation.
There are load from vector of addresses to vector of places.
And equivalent store. Then there is conditional execution bits on every instruction, and thats per vector place. Then there is 128 registers. Also latency hiding by running lots of threads over same core could help,
Then there is specific loop instructions with non destructive register use over x86, which should save some extra instructions.
At hardware level the radeon posses relatively general purpose improvements over x86/SSE .
Then there is dot product, and other similar instructions that may come handy.
Now here's the catch, if you define the problem correctly you can render SSE useless and force x86 to spend lots of time loading stuff that fits inside your registers, and use lots of unnecessary move instructions conjunction with conditional moves, and practicly force them wait long instruction latencies (or hit even more to L1 cache), while GPU:s just have another work unit going, in another thread(s).
Of course there are plenty of counter examples on weaknesses of GPU:s but lets be clear, if the goal is to show the case where GPU:s are really superior over CPU:s no matter how much software optimization goes to other side. It can be done in a way that gives far bigger difference than previous theoretical calculations in here.
Of course no-one writes assembler on GPU side, and some of that is lost in translation, but still there is potential that doesn't exist in x86.
PS. I've never written GPU code so this is pure speculation based on ISA documentation, real world compilers may give different results. But ability to use SSE and other optimizations to overcome difference shown in papers is no different, a speculation.
Jouni
-Now back to work.
>
>If the paper says, "Hey, or GPU implementation crushes the CPU because we found
>a way to use the dedicated GPU hardware," then I'm fine. I also know that if my
>problem *can't* use this H/W, then the claimed speedups won't apply to my problems.
>
>Without using the dedicated GPU H/W, though, I'm back to my basic claim: Showing
>a speedup greater than the extra available compute I find dubious without an explanation.
There is type of vector parallerism that SSE instructions do not get and GPU instructions get enable based on the radeon ISA documentation.
There are load from vector of addresses to vector of places.
And equivalent store. Then there is conditional execution bits on every instruction, and thats per vector place. Then there is 128 registers. Also latency hiding by running lots of threads over same core could help,
Then there is specific loop instructions with non destructive register use over x86, which should save some extra instructions.
At hardware level the radeon posses relatively general purpose improvements over x86/SSE .
Then there is dot product, and other similar instructions that may come handy.
Now here's the catch, if you define the problem correctly you can render SSE useless and force x86 to spend lots of time loading stuff that fits inside your registers, and use lots of unnecessary move instructions conjunction with conditional moves, and practicly force them wait long instruction latencies (or hit even more to L1 cache), while GPU:s just have another work unit going, in another thread(s).
Of course there are plenty of counter examples on weaknesses of GPU:s but lets be clear, if the goal is to show the case where GPU:s are really superior over CPU:s no matter how much software optimization goes to other side. It can be done in a way that gives far bigger difference than previous theoretical calculations in here.
Of course no-one writes assembler on GPU side, and some of that is lost in translation, but still there is potential that doesn't exist in x86.
PS. I've never written GPU code so this is pure speculation based on ISA documentation, real world compilers may give different results. But ability to use SSE and other optimizations to overcome difference shown in papers is no different, a speculation.
Jouni
-Now back to work.