Article: Parallelism at HotPar 2010
By: Richard Cownie (tich.delete@this.pobox.com), August 3, 2010 10:15 am
Room: Moderated Discussions
none (none@none.com) on 8/3/10 wrote:
---------------------------
>
>This looks half of what it should be, as I'd expect one
>64-bit read/cycle which would give about 25 GB/s. Multiply
>by the number of cores, let's say 4 or 6, and you reach 100
>or 150 GB/s. That's only 6-10x less than the 1 TB/s figure
>you quoted for ATI, less impressive heh? :)
I think you're correct that the number should be a bit
higher: I saw another benchmark that showed about 40GB/s.
However, your data structures aren't necessarily going
to fit in the cpu caches - especially if you haven't taken
a lot of care optimizing them. And if they don't fit in
the rather small L1 caches, then you're not going to get
a 4x or 6x speedup from using multiple cores, because
everything will go through the shared L2 cache. [Hmm,
I guess Nehalem-EX has some fairly fancy multiple cache
+ ringbus to make this better - but that's probably not
what the benchmarks in the literature are comparing
against].
Anyway, taking a higher-level view of this, what the cpu
has is a small number of L1 caches, and a single shared
L2 cache with roughly the same bandwidth as a single L1.
And what a Radeon 5870 has is 20 small L1 caches each
with 50GB/sec bandwidth. And the GPU stuff is all optimized
for bandwidth rather than latency.
Roughly I'd expect you get 2x or 3x from optimizing
for bandwidth (e.g. a high-latency cache with a very
wide datapath), and then in the best case a 20x factor
from having 20 caches rather than a single L2. So that
could explain a 60x factor. And it's really easy to blow
2x in software details ...
Anyway, I'm not intending to knock CPUs, which I think are
just great these days [and getting better all the time,
as with the Nehalem-EX cache+ringbus stuff]. But if
you're looking to find an architectural difference which
can explain a very large factor in performance, then the
cache hierarchy seems the most likely explanation.
---------------------------
>
>This looks half of what it should be, as I'd expect one
>64-bit read/cycle which would give about 25 GB/s. Multiply
>by the number of cores, let's say 4 or 6, and you reach 100
>or 150 GB/s. That's only 6-10x less than the 1 TB/s figure
>you quoted for ATI, less impressive heh? :)
I think you're correct that the number should be a bit
higher: I saw another benchmark that showed about 40GB/s.
However, your data structures aren't necessarily going
to fit in the cpu caches - especially if you haven't taken
a lot of care optimizing them. And if they don't fit in
the rather small L1 caches, then you're not going to get
a 4x or 6x speedup from using multiple cores, because
everything will go through the shared L2 cache. [Hmm,
I guess Nehalem-EX has some fairly fancy multiple cache
+ ringbus to make this better - but that's probably not
what the benchmarks in the literature are comparing
against].
Anyway, taking a higher-level view of this, what the cpu
has is a small number of L1 caches, and a single shared
L2 cache with roughly the same bandwidth as a single L1.
And what a Radeon 5870 has is 20 small L1 caches each
with 50GB/sec bandwidth. And the GPU stuff is all optimized
for bandwidth rather than latency.
Roughly I'd expect you get 2x or 3x from optimizing
for bandwidth (e.g. a high-latency cache with a very
wide datapath), and then in the best case a 20x factor
from having 20 caches rather than a single L2. So that
could explain a 60x factor. And it's really easy to blow
2x in software details ...
Anyway, I'm not intending to knock CPUs, which I think are
just great these days [and getting better all the time,
as with the Nehalem-EX cache+ringbus stuff]. But if
you're looking to find an architectural difference which
can explain a very large factor in performance, then the
cache hierarchy seems the most likely explanation.