Article: Introduction to OpenCL
By: Wainwright (firstname.lastname@example.org), December 11, 2010 3:44 pm
Room: Moderated Discussions
>For example, I've seen 3x performance slowdown using vectorized OpenCL code which
>assumes a 4-wide work-item, while running on Nvidia chips.
This does not make much sense to me, unless:
In the inital 1-wide implementation, you use say 512 threads per block.
In the 4-wide implementation, you still use 512 threads.
Doing this without changing anything else might very will cause issues with registers being spilled into GPU-RAM as not all registers would fit on-chip --> you get lower performance.
If you don't make each thread the same weight, or at least each thread-block the same weight, you might very well get lower perfromance. But that isn't due to the use of 4-wide work items.
By the way, 4-wide instructions are good for AMD GPUs as that helps the compiler find ILP so that it can use the 5-vliw units, correct? If that is the case, should not 4-wide elements helt the GF104 GPUs find some ILP so that it can easier utilize the 3-warps 2-schedulers architecture?