Article: Parallelism at HotPar 2010
By: Mark Roulo (nothanks.delete@this.xxx.com), August 4, 2010 11:29 am
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 8/4/10 wrote:
---------------------------
>Ants Aasma (ants.aasma@eesti.ee) on 8/4/10 wrote:
>---------------------------
>>Richard Cownie (tich@pobox.com) on 8/3/10 wrote:
>>---------------------------
>>>>Cache sizes are radically different though. Westmere has >32KB/core L1, 256KB/core L2 and 2MB/core L3.
>>>
>>>Not so radically different, a Radeon 5870 has 20 L1 caches
>>>each with 8KB for a total of 160KB. That's bigger than
>>>4 x 32KB L1's = 128KB, but smaller than 4 x 256KB L2's.
>>>Of course the larger number of caches is less effective
>>>if much of the data has to be replicated.
>>
>>The GPU is mostly optimized for multithreading. Most of the
>>memory is in the humongous register set(s). The 5870 has 5MiB
>>of registers.
>
>That's approximately 1000x advantage over Westmere or Magny Course.
Yes, but there are different tradeoffs being made here. 1000x the register memory doesn't map to a 1000x performance gain :-) Also note that the 1000x advantage is only architecturally visible x86 registers vs. the GPU registers. With register renaming, the difference is probably closer to 200x :-)
But the primary point to all these registers is to make thread-switching have zero cost. Which the GPUs *need* to hide latencies to memories and between instructions. Effectively, each SM in an nVidia GPU does a thread-switch after each instruction. Which is fine, since they are optimized for throughput, but also necessary to hide dependencies between instruction. In any event, this makes interpreting the comparison difficult.
-Mark Roulo
---------------------------
>Ants Aasma (ants.aasma@eesti.ee) on 8/4/10 wrote:
>---------------------------
>>Richard Cownie (tich@pobox.com) on 8/3/10 wrote:
>>---------------------------
>>>>Cache sizes are radically different though. Westmere has >32KB/core L1, 256KB/core L2 and 2MB/core L3.
>>>
>>>Not so radically different, a Radeon 5870 has 20 L1 caches
>>>each with 8KB for a total of 160KB. That's bigger than
>>>4 x 32KB L1's = 128KB, but smaller than 4 x 256KB L2's.
>>>Of course the larger number of caches is less effective
>>>if much of the data has to be replicated.
>>
>>The GPU is mostly optimized for multithreading. Most of the
>>memory is in the humongous register set(s). The 5870 has 5MiB
>>of registers.
>
>That's approximately 1000x advantage over Westmere or Magny Course.
Yes, but there are different tradeoffs being made here. 1000x the register memory doesn't map to a 1000x performance gain :-) Also note that the 1000x advantage is only architecturally visible x86 registers vs. the GPU registers. With register renaming, the difference is probably closer to 200x :-)
But the primary point to all these registers is to make thread-switching have zero cost. Which the GPUs *need* to hide latencies to memories and between instructions. Effectively, each SM in an nVidia GPU does a thread-switch after each instruction. Which is fine, since they are optimized for throughput, but also necessary to hide dependencies between instruction. In any event, this makes interpreting the comparison difficult.
-Mark Roulo