Article: Parallelism at HotPar 2010
By: Ants Aasma (ants.aasma.delete@this.eesti.ee), August 4, 2010 3:01 pm
Room: Moderated Discussions
Mark Roulo (nothanks@xxx.com) on 8/4/10 wrote:
---------------------------
>The G200 family had what you described. The *throughput* was also as you described,
>but the *latency* was not. You needed about 6 warps per SM to hide the latency.
>
>Fermi changed things so that you have 32 wide physical SMs. You still need about
>6 warps per SM to hide the instruction latency.
Interesting. So there is a 24 cycle latency between dependent instructions on the G200. I guess they don't have a forwarding network so the data has to go through the register file. Makes perfect sense if I think about it. It's not like they didn't have enough data routing issues to deal with as it is.
---------------------------
>The G200 family had what you described. The *throughput* was also as you described,
>but the *latency* was not. You needed about 6 warps per SM to hide the latency.
>
>Fermi changed things so that you have 32 wide physical SMs. You still need about
>6 warps per SM to hide the instruction latency.
Interesting. So there is a 24 cycle latency between dependent instructions on the G200. I guess they don't have a forwarding network so the data has to go through the register file. Makes perfect sense if I think about it. It's not like they didn't have enough data routing issues to deal with as it is.