Article: Parallelism at HotPar 2010
By: Mark Roulo (nothanks.delete@this.xxx.com), August 3, 2010 10:23 am
Room: Moderated Discussions
Richard Cownie (tich@pobox.com) on 8/3/10 wrote:
---------------------------
>However, your data structures aren't necessarily going
>to fit in the cpu caches - especially if you haven't taken
>a lot of care optimizing them. And if they don't fit in
>the rather small L1 caches, then you're not going to get
>a 4x or 6x speedup from using multiple cores, because
>everything will go through the shared L2 cache. [Hmm,
>I guess Nehalem-EX has some fairly fancy multiple cache
>+ ringbus to make this better - but that's probably not
>what the benchmarks in the literature are comparing
>against].
The Nehalem L2 caches (256 KB) are per-core and are not shared. The pre-Nehalem Intel chips had larger, shared L2s. On Nehalem, the trick is to not need the L3 (which is shared, and which the cores *will* fight over).
But if the CPU code is spilling out of the 256 KB L2, then the GPU code will likely be spilling out of the 48 KB shared memory ...
-Mark Roulo
---------------------------
>However, your data structures aren't necessarily going
>to fit in the cpu caches - especially if you haven't taken
>a lot of care optimizing them. And if they don't fit in
>the rather small L1 caches, then you're not going to get
>a 4x or 6x speedup from using multiple cores, because
>everything will go through the shared L2 cache. [Hmm,
>I guess Nehalem-EX has some fairly fancy multiple cache
>+ ringbus to make this better - but that's probably not
>what the benchmarks in the literature are comparing
>against].
The Nehalem L2 caches (256 KB) are per-core and are not shared. The pre-Nehalem Intel chips had larger, shared L2s. On Nehalem, the trick is to not need the L3 (which is shared, and which the cores *will* fight over).
But if the CPU code is spilling out of the 256 KB L2, then the GPU code will likely be spilling out of the 48 KB shared memory ...
-Mark Roulo