Article: Parallelism at HotPar 2010
By: Ants Aasma (ants.aasma.delete@this.eesti.ee), August 4, 2010 1:00 pm
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 8/4/10 wrote:
---------------------------
>Huge bandwidth advantage, yes, but, like with other throughput comparisons, over
>Westmere (~1.2 TB/s of SIMD register read bandwidth+ some more of GPRs) it's more
>like 10x rather than 100x. Magny Course is closer yet.
That isn't an adequate comparison. The amount of data reuse you can get out of 16 registers is quite limited. Whereas 256KB can give significant data reuse for algorithms amenable to cache blocking. So in that sense it's a bit more appropriate to compare the GPU register file bandwidth to the CPUs L1 and L2 bandwidth, but that's not apples to apples either. That should be multiplied by what ever the reuse factor inside the register file is.
---------------------------
>Huge bandwidth advantage, yes, but, like with other throughput comparisons, over
>Westmere (~1.2 TB/s of SIMD register read bandwidth+ some more of GPRs) it's more
>like 10x rather than 100x. Magny Course is closer yet.
That isn't an adequate comparison. The amount of data reuse you can get out of 16 registers is quite limited. Whereas 256KB can give significant data reuse for algorithms amenable to cache blocking. So in that sense it's a bit more appropriate to compare the GPU register file bandwidth to the CPUs L1 and L2 bandwidth, but that's not apples to apples either. That should be multiplied by what ever the reuse factor inside the register file is.