Article: Parallelism at HotPar 2010
By: Ants Aasma (ants.aasma.delete@this.eesti.ee), August 4, 2010 1:10 pm
Room: Moderated Discussions
Mark Roulo (nothanks@xxx.com) on 8/4/10 wrote:
---------------------------
>Effectively, each SM in an nVidia GPU does a thread-switch after each instruction.
>Which is fine, since they are optimized for throughput, but also necessary to hide
>dependencies between instruction.
I'm not sure that that's completely true. The instruction latency is partly if not fully masked by having the logical vector larger than the physical (a'la SSE prior to Core2). Probably helps also by having the scheduling and other control logic run at a lower rate. Too lazy to go look it up, but I think NVidia had physical vector length 8 and logical 32, so each instruction executes for 4 cycles.
---------------------------
>Effectively, each SM in an nVidia GPU does a thread-switch after each instruction.
>Which is fine, since they are optimized for throughput, but also necessary to hide
>dependencies between instruction.
I'm not sure that that's completely true. The instruction latency is partly if not fully masked by having the logical vector larger than the physical (a'la SSE prior to Core2). Probably helps also by having the scheduling and other control logic run at a lower rate. Too lazy to go look it up, but I think NVidia had physical vector length 8 and logical 32, so each instruction executes for 4 cycles.