Article: Parallelism at HotPar 2010
By: Mark Roulo (nothanks.delete@this.xxx.com), August 3, 2010 10:19 am
Room: Moderated Discussions
Richard Cownie (tich@pobox.com) on 8/3/10 wrote:
---------------------------
>It gives you 20 separate L1 texture caches with an
>aggregate bandwidth of 1TB/sec; and L2 caches which
>can supply 435GB/sec.
>
>Compare against this for Nehalem:
>
>http://arstechnica.com/hardware/reviews/2008/11/nehalem-launch-review.ars/4
>>
>
>... showing cache bandwidth of about 12GB/sec (though
>maybe you could boost that using multiple threads ?)
>
>And you get factor of 1024/12 which is about 85x.
The Nehalem B/W is for a single core, though, right? So on a quad-core Nehalem you'd have closer to 50 GB/sec of cache B/W, which works out to a 20x GPU advantage.
*IF* the B/W on the Nehalem was measured with scalar code (say 4-byte wide ints) instead of vector code (16-byte wide) and the cache *IS* 128-bits wide (which would make sense), then we'd see closer to 200 GB/sec of cache B/W for a four core Nehalem.
~10 GB/sec for a Nehalem core seems suspiciously low to me. Assume a 2.5 GHz clock, then the cache can only move 4 bytes per clock in or out of the registers. Really?
-Mark Roulo
---------------------------
>It gives you 20 separate L1 texture caches with an
>aggregate bandwidth of 1TB/sec; and L2 caches which
>can supply 435GB/sec.
>
>Compare against this for Nehalem:
>
>http://arstechnica.com/hardware/reviews/2008/11/nehalem-launch-review.ars/4
>>
>
>... showing cache bandwidth of about 12GB/sec (though
>maybe you could boost that using multiple threads ?)
>
>And you get factor of 1024/12 which is about 85x.
The Nehalem B/W is for a single core, though, right? So on a quad-core Nehalem you'd have closer to 50 GB/sec of cache B/W, which works out to a 20x GPU advantage.
*IF* the B/W on the Nehalem was measured with scalar code (say 4-byte wide ints) instead of vector code (16-byte wide) and the cache *IS* 128-bits wide (which would make sense), then we'd see closer to 200 GB/sec of cache B/W for a four core Nehalem.
~10 GB/sec for a Nehalem core seems suspiciously low to me. Assume a 2.5 GHz clock, then the cache can only move 4 bytes per clock in or out of the registers. Really?
-Mark Roulo