Article: Parallelism at HotPar 2010
By: David Kanter (dkanter.delete@this.realworldtech.com), August 3, 2010 12:28 pm
Room: Moderated Discussions
Richard Cownie (tich@pobox.com) on 8/3/10 wrote:
---------------------------
>Mark Roulo (nothanks@xxx.com) on 8/2/10 wrote:
>---------------------------
>
>>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>>
>>The raw bandwidth advantage is in the 4x to 5x range.
>
>I suspect that the apps which show the really impressive
>speedups are those which can map nicely onto the
>GPU's rather weird memory hierarchy, especially the
>texture caches.
>
>See this slide about Cypress (Radeon 5870) for example:
>
>http://www.techpowerup.com/reviews/AMD/HD_5000_Leaks/images/arch6.jpg
>
>It gives you 20 separate L1 texture caches with an
>aggregate bandwidth of 1TB/sec; and L2 caches which
>can supply 435GB/sec.
>Compare against this for Nehalem:
>
>http://arstechnica.com/hardware/reviews/2008/11/nehalem-launch-review.ars/4
>>
>
>... showing cache bandwidth of about 12GB/sec (though
>maybe you could boost that using multiple threads ?)
That's an excellent point - there is a difference in on-chip bandwidth, but I suspect it's in the same neighborhood as the differences between the memory controllers.
The bandwidth for Nehalem is certainly wrong for the Ars piece. Look here:
http://techreport.com/articles.x/19196/4 (measured Sandra results)
http://portal.acm.org/citation.cfm?id=1669165 (directed latency/bandwidth tests for lines in different cache states)
According to the paper from Hackenberg, using a copy test on 8 cores (two Nehalem sockets), they can 623GB/s, 278GB/s and 76GB/s for the caches. Cutting that in half, yields 311, 139 and 38GB/s respectively for a 4 core Nehalem. Westmere should hit 466GB/s, 208.5GB/s. I don't know if the L3 cache bandwidth increased by 50% though...
Cache sizes are radically different though. Westmere has 32KB/core L1, 256KB/core L2 and 2MB/core L3.
So realistically, you're talking about a difference of ~2X for cache bandwidth (just comparing L1 to L1 and L2 to L2). I could believe that in practice, the advantage is more like 3X...given the different optimizations in the cache hierarchy, but I don't see well written code getting much higher than that.
>If your app needs huge bandwidth for repeated read-only
>accesses to not-too-big data with very good locality,
>then it may be a good fit. If not, then you're back in
>the 4x-10x range suggested by DRAM bandwidth and GFLOPs
>(and probably a bit below that, because of the various
>problems of the not-so-flexible GPU processors and the
>massive parallelism needed to keep everything busy).
I think the difference is going to be more pronounced for workloads which are out of cache on the CPU, but within memory on the GPU.
David
---------------------------
>Mark Roulo (nothanks@xxx.com) on 8/2/10 wrote:
>---------------------------
>
>>The raw compute advantage of an nVidia Fermi GPU vs. a 6-core Intel CPU is in the 10x range.
>>
>>The raw bandwidth advantage is in the 4x to 5x range.
>
>I suspect that the apps which show the really impressive
>speedups are those which can map nicely onto the
>GPU's rather weird memory hierarchy, especially the
>texture caches.
>
>See this slide about Cypress (Radeon 5870) for example:
>
>http://www.techpowerup.com/reviews/AMD/HD_5000_Leaks/images/arch6.jpg
>
>It gives you 20 separate L1 texture caches with an
>aggregate bandwidth of 1TB/sec; and L2 caches which
>can supply 435GB/sec.
>Compare against this for Nehalem:
>
>http://arstechnica.com/hardware/reviews/2008/11/nehalem-launch-review.ars/4
>>
>
>... showing cache bandwidth of about 12GB/sec (though
>maybe you could boost that using multiple threads ?)
That's an excellent point - there is a difference in on-chip bandwidth, but I suspect it's in the same neighborhood as the differences between the memory controllers.
The bandwidth for Nehalem is certainly wrong for the Ars piece. Look here:
http://techreport.com/articles.x/19196/4 (measured Sandra results)
http://portal.acm.org/citation.cfm?id=1669165 (directed latency/bandwidth tests for lines in different cache states)
According to the paper from Hackenberg, using a copy test on 8 cores (two Nehalem sockets), they can 623GB/s, 278GB/s and 76GB/s for the caches. Cutting that in half, yields 311, 139 and 38GB/s respectively for a 4 core Nehalem. Westmere should hit 466GB/s, 208.5GB/s. I don't know if the L3 cache bandwidth increased by 50% though...
Cache sizes are radically different though. Westmere has 32KB/core L1, 256KB/core L2 and 2MB/core L3.
So realistically, you're talking about a difference of ~2X for cache bandwidth (just comparing L1 to L1 and L2 to L2). I could believe that in practice, the advantage is more like 3X...given the different optimizations in the cache hierarchy, but I don't see well written code getting much higher than that.
>If your app needs huge bandwidth for repeated read-only
>accesses to not-too-big data with very good locality,
>then it may be a good fit. If not, then you're back in
>the 4x-10x range suggested by DRAM bandwidth and GFLOPs
>(and probably a bit below that, because of the various
>problems of the not-so-flexible GPU processors and the
>massive parallelism needed to keep everything busy).
I think the difference is going to be more pronounced for workloads which are out of cache on the CPU, but within memory on the GPU.
David