Article: Parallelism at HotPar 2010
By: David Kanter (dkanter.delete@this.realworldtech.com), August 3, 2010 1:21 pm
Room: Moderated Discussions
Richard Cownie (tich@pobox.com) on 8/3/10 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 8/3/10 wrote:
>---------------------------
>
>>According to the paper from Hackenberg, using a copy test on 8 cores (two Nehalem
>>sockets), they can 623GB/s, 278GB/s and 76GB/s for the caches. Cutting that in
>>half, yields 311, 139 and 38GB/s respectively for a 4 core Nehalem. Westmere should
>>hit 466GB/s, 208.5GB/s. I don't know if the L3 cache bandwidth increased by 50% though...
>
>Thanks, those figures sound more plausible. But maybe
>there's still an extra factor of 2x if we only want read
>bandwidth, not read bandwidth + write bandwidth ?
That's true for Nehalem, although not any AMD designs, nor for Sandy Bridge.
>>Cache sizes are radically different though. Westmere has >32KB/core L1, 256KB/core L2 and 2MB/core L3.
>
>Not so radically different, a Radeon 5870 has 20 L1 caches
>each with 8KB for a total of 160KB. That's bigger than
>4 x 32KB L1's = 128KB, but smaller than 4 x 256KB L2's.
>Of course the larger number of caches is less effective
>if much of the data has to be replicated.
Westmere is 6 cores, so cumulative 192KB L1D, and 1.5MB L2.
>>So realistically, you're talking about a difference of ~2X for cache bandwidth
>>(just comparing L1 to L1 and L2 to L2).
>> I could believe that in practice, the advantage
>>is more like 3X...given the different optimizations in the cache hierarchy, but
>>I don't see well written code getting much higher than that.
>
>Sounds plausible. So on this analysis, the GPU cache
>hierarchy offered some quite big advantages over the
>Core2 hierarchy (for certain apps); but Nehalem's
>multiple L2 cache's helped it catch up; and then Westmere's
>higher clock speeds and other tweaks will have closed the
>gap again.
I'm not quite sure I'd phrase it that way...in the sense that we're talking about a *very* modern GPU. Cypress is newer than Nehalem and more contemporary to Westmere.
IIRC RV770 was more contemporary to Nehalem and not the Core2.
>With SandyBridge offering AVX and a bunch of architectural
>tweaks, we can probably expect the CPU:GPU ratio to move
>further towards CPUs over the next year. Though there
>may still be some niches where the GPU looks attractive
>for its performance-per-dollar or performance-per-watt -
>Intel charges a fortune for the big cpu's with lots
>of cores and lots of cache ...
For read bandwidth, Sandy Bridge is key because they have two independent AGUs...rather than 1 LD AGU + 1 ST AGU. So now you can really fire off two loads per cycle. Although the cache is limited to 48B/cycle.
In some ways, it may be even more interesting to compare Bulldozer to a GPU...since I get the sense that they made a deliberate decision to optimize for throughput at the cost of per-core performance (whereas Intel has not made that choice for mainstream parts).
>At the least, anyone expecting comparisons from 2007-2008
>to give a guide to results in 2011 is probably going to
>have some surprises.
Yeah, I'd say that's the biggest take away. A multi-core CPU can take advantage of die area just as much as a multi-core device targeted at graphics (GPU). There's a difference of about 2-4X due to architecture and optimization points for the vast majority of GPGPU workloads. For workloads that can use HW that is unique to the GPU, the difference is bigger...but that's an exceptionally rare case.
And of course for non-data parallel workloads, the CPU will probably blow away the GPU by a huge margin.
The one tricky part is that Intel's manufacturing is 12-18 months ahead of TSMC and GF...so in some cases an Intel multi-core will have a fair advantage due to power/area improvements.
David
---------------------------
>David Kanter (dkanter@realworldtech.com) on 8/3/10 wrote:
>---------------------------
>
>>According to the paper from Hackenberg, using a copy test on 8 cores (two Nehalem
>>sockets), they can 623GB/s, 278GB/s and 76GB/s for the caches. Cutting that in
>>half, yields 311, 139 and 38GB/s respectively for a 4 core Nehalem. Westmere should
>>hit 466GB/s, 208.5GB/s. I don't know if the L3 cache bandwidth increased by 50% though...
>
>Thanks, those figures sound more plausible. But maybe
>there's still an extra factor of 2x if we only want read
>bandwidth, not read bandwidth + write bandwidth ?
That's true for Nehalem, although not any AMD designs, nor for Sandy Bridge.
>>Cache sizes are radically different though. Westmere has >32KB/core L1, 256KB/core L2 and 2MB/core L3.
>
>Not so radically different, a Radeon 5870 has 20 L1 caches
>each with 8KB for a total of 160KB. That's bigger than
>4 x 32KB L1's = 128KB, but smaller than 4 x 256KB L2's.
>Of course the larger number of caches is less effective
>if much of the data has to be replicated.
Westmere is 6 cores, so cumulative 192KB L1D, and 1.5MB L2.
>>So realistically, you're talking about a difference of ~2X for cache bandwidth
>>(just comparing L1 to L1 and L2 to L2).
>> I could believe that in practice, the advantage
>>is more like 3X...given the different optimizations in the cache hierarchy, but
>>I don't see well written code getting much higher than that.
>
>Sounds plausible. So on this analysis, the GPU cache
>hierarchy offered some quite big advantages over the
>Core2 hierarchy (for certain apps); but Nehalem's
>multiple L2 cache's helped it catch up; and then Westmere's
>higher clock speeds and other tweaks will have closed the
>gap again.
I'm not quite sure I'd phrase it that way...in the sense that we're talking about a *very* modern GPU. Cypress is newer than Nehalem and more contemporary to Westmere.
IIRC RV770 was more contemporary to Nehalem and not the Core2.
>With SandyBridge offering AVX and a bunch of architectural
>tweaks, we can probably expect the CPU:GPU ratio to move
>further towards CPUs over the next year. Though there
>may still be some niches where the GPU looks attractive
>for its performance-per-dollar or performance-per-watt -
>Intel charges a fortune for the big cpu's with lots
>of cores and lots of cache ...
For read bandwidth, Sandy Bridge is key because they have two independent AGUs...rather than 1 LD AGU + 1 ST AGU. So now you can really fire off two loads per cycle. Although the cache is limited to 48B/cycle.
In some ways, it may be even more interesting to compare Bulldozer to a GPU...since I get the sense that they made a deliberate decision to optimize for throughput at the cost of per-core performance (whereas Intel has not made that choice for mainstream parts).
>At the least, anyone expecting comparisons from 2007-2008
>to give a guide to results in 2011 is probably going to
>have some surprises.
Yeah, I'd say that's the biggest take away. A multi-core CPU can take advantage of die area just as much as a multi-core device targeted at graphics (GPU). There's a difference of about 2-4X due to architecture and optimization points for the vast majority of GPGPU workloads. For workloads that can use HW that is unique to the GPU, the difference is bigger...but that's an exceptionally rare case.
And of course for non-data parallel workloads, the CPU will probably blow away the GPU by a huge margin.
The one tricky part is that Intel's manufacturing is 12-18 months ahead of TSMC and GF...so in some cases an Intel multi-core will have a fair advantage due to power/area improvements.
David