Article: Parallelism at HotPar 2010
By: Richard Cownie (tich.delete@this.pobox.com), August 3, 2010 3:55 pm
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 8/3/10 wrote:
---------------------------
>I'm not quite sure I'd phrase it that way...in the sense that we're talking about
>a *very* modern GPU. Cypress is newer than Nehalem and more contemporary to Westmere.
>
>IIRC RV770 was more contemporary to Nehalem and not the Core2.
Fair enough. Though probably what's most relevant to Mark
Roulo's skepticism about the claimed speedups would be
the comparison of Core2 against Nvidia's chips from 2008.
I don't know those details.
>For read bandwidth, Sandy Bridge is key because they have two independent AGUs...rather
>than 1 LD AGU + 1 ST AGU. So now you can really fire off two loads per cycle.
>Although the cache is limited to 48B/cycle.
>
>In some ways, it may be even more interesting to compare Bulldozer to a GPU...since
>I get the sense that they made a deliberate decision to optimize for throughput
>at the cost of per-core performance (whereas Intel has not made that choice for mainstream parts).
Yes. Lots of good stuff coming soon.
>
>>At the least, anyone expecting comparisons from 2007-2008
>>to give a guide to results in 2011 is probably going to
>>have some surprises.
>
>Yeah, I'd say that's the biggest take away. A multi-core CPU can take advantage
>of die area just as much as a multi-core device targeted at graphics (GPU). There's
>a difference of about 2-4X due to architecture and optimization points for the vast
>majority of GPGPU workloads. For workloads that can use HW that is unique to the
>GPU, the difference is bigger...but that's an exceptionally rare case.
It's very interesting in this discussion to understand
how Nehalem and Westmere have already evolved the CPU's
cache hierarchy towards supporting much more parallelism.
There must be some extra cost for keeping everything
coherent, but a 6-core Westmere with 6 parallel L1's
and 6 parallel L2's and 3 parallel DDR3 channels is a very
different animal - and much more GPU-like - than a
Pentium4, or even than a quad-core Core2. From that
point of view the convergence of CPU and GPU is already
quite far along in a lot of ways.
>
>And of course for non-data parallel workloads, the CPU will probably blow away the GPU by a huge margin.
>
>The one tricky part is that Intel's manufacturing is 12-18 months ahead of TSMC
>and GF...so in some cases an Intel multi-core will have a fair advantage due to power/area improvements.
Yes, but then an x86 has to pay a penalty to keep cache
coherency, and also to be able run the BIOS and DOS :-(
And there's the question of whether the GPGPU,
even if it only gives a modest speedup, might achieve
higher performance per watt.
My best guess is that in systems that need GPUs for
graphics - i.e. anything with a screen - they'll also
get used for a few friendly GPGPU apps. But that with
high-end x86 evolving so many throughput-enhancing
features, GPGPU probably is only going to get stuck
with only a small share of HPC.
---------------------------
>I'm not quite sure I'd phrase it that way...in the sense that we're talking about
>a *very* modern GPU. Cypress is newer than Nehalem and more contemporary to Westmere.
>
>IIRC RV770 was more contemporary to Nehalem and not the Core2.
Fair enough. Though probably what's most relevant to Mark
Roulo's skepticism about the claimed speedups would be
the comparison of Core2 against Nvidia's chips from 2008.
I don't know those details.
>For read bandwidth, Sandy Bridge is key because they have two independent AGUs...rather
>than 1 LD AGU + 1 ST AGU. So now you can really fire off two loads per cycle.
>Although the cache is limited to 48B/cycle.
>
>In some ways, it may be even more interesting to compare Bulldozer to a GPU...since
>I get the sense that they made a deliberate decision to optimize for throughput
>at the cost of per-core performance (whereas Intel has not made that choice for mainstream parts).
Yes. Lots of good stuff coming soon.
>
>>At the least, anyone expecting comparisons from 2007-2008
>>to give a guide to results in 2011 is probably going to
>>have some surprises.
>
>Yeah, I'd say that's the biggest take away. A multi-core CPU can take advantage
>of die area just as much as a multi-core device targeted at graphics (GPU). There's
>a difference of about 2-4X due to architecture and optimization points for the vast
>majority of GPGPU workloads. For workloads that can use HW that is unique to the
>GPU, the difference is bigger...but that's an exceptionally rare case.
It's very interesting in this discussion to understand
how Nehalem and Westmere have already evolved the CPU's
cache hierarchy towards supporting much more parallelism.
There must be some extra cost for keeping everything
coherent, but a 6-core Westmere with 6 parallel L1's
and 6 parallel L2's and 3 parallel DDR3 channels is a very
different animal - and much more GPU-like - than a
Pentium4, or even than a quad-core Core2. From that
point of view the convergence of CPU and GPU is already
quite far along in a lot of ways.
>
>And of course for non-data parallel workloads, the CPU will probably blow away the GPU by a huge margin.
>
>The one tricky part is that Intel's manufacturing is 12-18 months ahead of TSMC
>and GF...so in some cases an Intel multi-core will have a fair advantage due to power/area improvements.
Yes, but then an x86 has to pay a penalty to keep cache
coherency, and also to be able run the BIOS and DOS :-(
And there's the question of whether the GPGPU,
even if it only gives a modest speedup, might achieve
higher performance per watt.
My best guess is that in systems that need GPUs for
graphics - i.e. anything with a screen - they'll also
get used for a few friendly GPGPU apps. But that with
high-end x86 evolving so many throughput-enhancing
features, GPGPU probably is only going to get stuck
with only a small share of HPC.