Article: Parallelism at HotPar 2010
By: Richard Cownie (tich.delete@this.pobox.com), August 3, 2010 12:52 pm
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 8/3/10 wrote:
---------------------------
>According to the paper from Hackenberg, using a copy test on 8 cores (two Nehalem
>sockets), they can 623GB/s, 278GB/s and 76GB/s for the caches. Cutting that in
>half, yields 311, 139 and 38GB/s respectively for a 4 core Nehalem. Westmere should
>hit 466GB/s, 208.5GB/s. I don't know if the L3 cache bandwidth increased by 50% though...
Thanks, those figures sound more plausible. But maybe
there's still an extra factor of 2x if we only want read
bandwidth, not read bandwidth + write bandwidth ?
>Cache sizes are radically different though. Westmere has >32KB/core L1, 256KB/core L2 and 2MB/core L3.
Not so radically different, a Radeon 5870 has 20 L1 caches
each with 8KB for a total of 160KB. That's bigger than
4 x 32KB L1's = 128KB, but smaller than 4 x 256KB L2's.
Of course the larger number of caches is less effective
if much of the data has to be replicated.
>So realistically, you're talking about a difference of ~2X for cache bandwidth
>(just comparing L1 to L1 and L2 to L2).
> I could believe that in practice, the advantage
>is more like 3X...given the different optimizations in the cache hierarchy, but
>I don't see well written code getting much higher than that.
Sounds plausible. So on this analysis, the GPU cache
hierarchy offered some quite big advantages over the
Core2 hierarchy (for certain apps); but Nehalem's
multiple L2 cache's helped it catch up; and then Westmere's
higher clock speeds and other tweaks will have closed the
gap again.
With SandyBridge offering AVX and a bunch of architectural
tweaks, we can probably expect the CPU:GPU ratio to move
further towards CPUs over the next year. Though there
may still be some niches where the GPU looks attractive
for its performance-per-dollar or performance-per-watt -
Intel charges a fortune for the big cpu's with lots
of cores and lots of cache ...
At the least, anyone expecting comparisons from 2007-2008
to give a guide to results in 2011 is probably going to
have some surprises.
---------------------------
>According to the paper from Hackenberg, using a copy test on 8 cores (two Nehalem
>sockets), they can 623GB/s, 278GB/s and 76GB/s for the caches. Cutting that in
>half, yields 311, 139 and 38GB/s respectively for a 4 core Nehalem. Westmere should
>hit 466GB/s, 208.5GB/s. I don't know if the L3 cache bandwidth increased by 50% though...
Thanks, those figures sound more plausible. But maybe
there's still an extra factor of 2x if we only want read
bandwidth, not read bandwidth + write bandwidth ?
>Cache sizes are radically different though. Westmere has >32KB/core L1, 256KB/core L2 and 2MB/core L3.
Not so radically different, a Radeon 5870 has 20 L1 caches
each with 8KB for a total of 160KB. That's bigger than
4 x 32KB L1's = 128KB, but smaller than 4 x 256KB L2's.
Of course the larger number of caches is less effective
if much of the data has to be replicated.
>So realistically, you're talking about a difference of ~2X for cache bandwidth
>(just comparing L1 to L1 and L2 to L2).
> I could believe that in practice, the advantage
>is more like 3X...given the different optimizations in the cache hierarchy, but
>I don't see well written code getting much higher than that.
Sounds plausible. So on this analysis, the GPU cache
hierarchy offered some quite big advantages over the
Core2 hierarchy (for certain apps); but Nehalem's
multiple L2 cache's helped it catch up; and then Westmere's
higher clock speeds and other tweaks will have closed the
gap again.
With SandyBridge offering AVX and a bunch of architectural
tweaks, we can probably expect the CPU:GPU ratio to move
further towards CPUs over the next year. Though there
may still be some niches where the GPU looks attractive
for its performance-per-dollar or performance-per-watt -
Intel charges a fortune for the big cpu's with lots
of cores and lots of cache ...
At the least, anyone expecting comparisons from 2007-2008
to give a guide to results in 2011 is probably going to
have some surprises.