Article: Parallelism at HotPar 2010
By: Michael S (already5chosen.delete@this.yahoo.com), August 4, 2010 11:33 am
Room: Moderated Discussions
Ants Aasma (ants.aasma@eesti.ee) on 8/4/10 wrote:
---------------------------
>Richard Cownie (tich@pobox.com) on 8/3/10 wrote:
>---------------------------
>>>Cache sizes are radically different though. Westmere has >32KB/core L1, 256KB/core L2 and 2MB/core L3.
>>
>>Not so radically different, a Radeon 5870 has 20 L1 caches
>>each with 8KB for a total of 160KB. That's bigger than
>>4 x 32KB L1's = 128KB, but smaller than 4 x 256KB L2's.
>>Of course the larger number of caches is less effective
>>if much of the data has to be replicated.
>
>The GPU is mostly optimized for multithreading. Most of the
>memory is in the humongous register set(s). The 5870 has 5MiB
>of registers.
That's approximately 1000x advantage over Westmere or Magny Course.
>Each core can fetch 12 64byte operands per
>clock (16way SIMD) and there are 20 cores. So the bandwidth
>from the register set is about 13TB/s, factor in the writes
>and you should get something on the order of 18TB/s. If the
>workload can be partitioned into the register set, then there
>is a huge bandwidth advantage.
Huge bandwidth advantage, yes, but, like with other throughput comparisons, over Westmere (~1.2 TB/s of SIMD register read bandwidth+ some more of GPRs) it's more like 10x rather than 100x. Magny Course is closer yet.
---------------------------
>Richard Cownie (tich@pobox.com) on 8/3/10 wrote:
>---------------------------
>>>Cache sizes are radically different though. Westmere has >32KB/core L1, 256KB/core L2 and 2MB/core L3.
>>
>>Not so radically different, a Radeon 5870 has 20 L1 caches
>>each with 8KB for a total of 160KB. That's bigger than
>>4 x 32KB L1's = 128KB, but smaller than 4 x 256KB L2's.
>>Of course the larger number of caches is less effective
>>if much of the data has to be replicated.
>
>The GPU is mostly optimized for multithreading. Most of the
>memory is in the humongous register set(s). The 5870 has 5MiB
>of registers.
That's approximately 1000x advantage over Westmere or Magny Course.
>Each core can fetch 12 64byte operands per
>clock (16way SIMD) and there are 20 cores. So the bandwidth
>from the register set is about 13TB/s, factor in the writes
>and you should get something on the order of 18TB/s. If the
>workload can be partitioned into the register set, then there
>is a huge bandwidth advantage.
Huge bandwidth advantage, yes, but, like with other throughput comparisons, over Westmere (~1.2 TB/s of SIMD register read bandwidth+ some more of GPRs) it's more like 10x rather than 100x. Magny Course is closer yet.