Article: Parallelism at HotPar 2010
By: Richard Cownie (tich.delete@this.pobox.com), August 3, 2010 9:27 am
Room: Moderated Discussions
Mark Roulo (nothanks@xxx.com) on 8/2/10 wrote:
---------------------------
>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>
>The raw bandwidth advantage is in the 4x to 5x range.
I suspect that the apps which show the really impressive
speedups are those which can map nicely onto the
GPU's rather weird memory hierarchy, especially the
texture caches.
See this slide about Cypress (Radeon 5870) for example:
http://www.techpowerup.com/reviews/AMD/HD_5000_Leaks/images/arch6.jpg
It gives you 20 separate L1 texture caches with an
aggregate bandwidth of 1TB/sec; and L2 caches which
can supply 435GB/sec.
Compare against this for Nehalem:
http://arstechnica.com/hardware/reviews/2008/11/nehalem-launch-review.ars/4
... showing cache bandwidth of about 12GB/sec (though
maybe you could boost that using multiple threads ?)
And you get factor of 1024/12 which is about 85x.
The graph also shows cache bandwidth for an older Core2
cpu as about 9GB/sec, which would give a factor 114x.
So you can see how you get a 100x figure, with slightly
half-assed code on the cpu side, using an older cpu,
and exploiting that huge aggregate bandwidth of texture
caches on the GPU side.
If your app needs huge bandwidth for repeated read-only
accesses to not-too-big data with very good locality,
then it may be a good fit. If not, then you're back in
the 4x-10x range suggested by DRAM bandwidth and GFLOPs
(and probably a bit below that, because of the various
problems of the not-so-flexible GPU processors and the
massive parallelism needed to keep everything busy).
This is all guesswork, as I've never programmed a GPU.
---------------------------
>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>
>The raw bandwidth advantage is in the 4x to 5x range.
I suspect that the apps which show the really impressive
speedups are those which can map nicely onto the
GPU's rather weird memory hierarchy, especially the
texture caches.
See this slide about Cypress (Radeon 5870) for example:
http://www.techpowerup.com/reviews/AMD/HD_5000_Leaks/images/arch6.jpg
It gives you 20 separate L1 texture caches with an
aggregate bandwidth of 1TB/sec; and L2 caches which
can supply 435GB/sec.
Compare against this for Nehalem:
http://arstechnica.com/hardware/reviews/2008/11/nehalem-launch-review.ars/4
... showing cache bandwidth of about 12GB/sec (though
maybe you could boost that using multiple threads ?)
And you get factor of 1024/12 which is about 85x.
The graph also shows cache bandwidth for an older Core2
cpu as about 9GB/sec, which would give a factor 114x.
So you can see how you get a 100x figure, with slightly
half-assed code on the cpu side, using an older cpu,
and exploiting that huge aggregate bandwidth of texture
caches on the GPU side.
If your app needs huge bandwidth for repeated read-only
accesses to not-too-big data with very good locality,
then it may be a good fit. If not, then you're back in
the 4x-10x range suggested by DRAM bandwidth and GFLOPs
(and probably a bit below that, because of the various
problems of the not-so-flexible GPU processors and the
massive parallelism needed to keep everything busy).
This is all guesswork, as I've never programmed a GPU.