Article: Parallelism at HotPar 2010
By: Mark Roulo (nothanks.delete@this.xxx.com), August 4, 2010 2:10 pm
Room: Moderated Discussions
Ants Aasma (ants.aasma@eesti.ee) on 8/4/10 wrote:
---------------------------
>Mark Roulo (nothanks@xxx.com) on 8/4/10 wrote:
>---------------------------
>>Effectively, each SM in an nVidia GPU does a thread-switch after each instruction.
>>Which is fine, since they are optimized for throughput, but also necessary to hide
>>dependencies between instruction.
>
>I'm not sure that that's completely true. The instruction latency is partly if
>not fully masked by having the logical vector larger than the physical (a'la SSE
>prior to Core2). Probably helps also by having the scheduling and other control
>logic run at a lower rate. Too lazy to go look it up, but I think NVidia had physical
>vector length 8 and logical 32, so each instruction executes for 4 cycles.
The G200 family had what you described. The *throughput* was also as you described, but the *latency* was not. You needed about 6 warps per SM to hide the latency.
Fermi changed things so that you have 32 wide physical SMs. You still need about 6 warps per SM to hide the instruction latency.
-Mark Roulo
---------------------------
>Mark Roulo (nothanks@xxx.com) on 8/4/10 wrote:
>---------------------------
>>Effectively, each SM in an nVidia GPU does a thread-switch after each instruction.
>>Which is fine, since they are optimized for throughput, but also necessary to hide
>>dependencies between instruction.
>
>I'm not sure that that's completely true. The instruction latency is partly if
>not fully masked by having the logical vector larger than the physical (a'la SSE
>prior to Core2). Probably helps also by having the scheduling and other control
>logic run at a lower rate. Too lazy to go look it up, but I think NVidia had physical
>vector length 8 and logical 32, so each instruction executes for 4 cycles.
The G200 family had what you described. The *throughput* was also as you described, but the *latency* was not. You needed about 6 warps per SM to hide the latency.
Fermi changed things so that you have 32 wide physical SMs. You still need about 6 warps per SM to hide the instruction latency.
-Mark Roulo