Article: Parallelism at HotPar 2010
By: Ants Aasma (ants.aasma.delete@this.eesti.ee), August 4, 2010 10:00 am
Room: Moderated Discussions
Richard Cownie (tich@pobox.com) on 8/3/10 wrote:
---------------------------
>>Cache sizes are radically different though. Westmere has >32KB/core L1, 256KB/core L2 and 2MB/core L3.
>
>Not so radically different, a Radeon 5870 has 20 L1 caches
>each with 8KB for a total of 160KB. That's bigger than
>4 x 32KB L1's = 128KB, but smaller than 4 x 256KB L2's.
>Of course the larger number of caches is less effective
>if much of the data has to be replicated.
The GPU is mostly optimized for multithreading. Most of the
memory is in the humongous register set(s). The 5870 has 5MiB
of registers. Each core can fetch 12 64byte operands per
clock (16way SIMD) and there are 20 cores. So the bandwidth
from the register set is about 13TB/s, factor in the writes
and you should get something on the order of 18TB/s. If the
workload can be partitioned into the register set, then there
is a huge bandwidth advantage.
---------------------------
>>Cache sizes are radically different though. Westmere has >32KB/core L1, 256KB/core L2 and 2MB/core L3.
>
>Not so radically different, a Radeon 5870 has 20 L1 caches
>each with 8KB for a total of 160KB. That's bigger than
>4 x 32KB L1's = 128KB, but smaller than 4 x 256KB L2's.
>Of course the larger number of caches is less effective
>if much of the data has to be replicated.
The GPU is mostly optimized for multithreading. Most of the
memory is in the humongous register set(s). The 5870 has 5MiB
of registers. Each core can fetch 12 64byte operands per
clock (16way SIMD) and there are 20 cores. So the bandwidth
from the register set is about 13TB/s, factor in the writes
and you should get something on the order of 18TB/s. If the
workload can be partitioned into the register set, then there
is a huge bandwidth advantage.