Article: Parallelism at HotPar 2010
By: Michael S (already5chosen.delete@this.yahoo.com), August 3, 2010 10:23 am
Room: Moderated Discussions
none (none@none.com) on 8/3/10 wrote:
---------------------------
>Richard Cownie (tich@pobox.com) on 8/3/10 wrote:
>---------------------------
>[...]
>>http://arstechnica.com/hardware/reviews/2008/11/nehalem-launch-review.ars/4
>>>
>>
>>... showing cache bandwidth of about 12GB/sec (though
>>maybe you could boost that using multiple threads ?)
>>
>>And you get factor of 1024/12 which is about 85x.
>
>This looks half of what it should be, as I'd expect one
>64-bit read/cycle which would give about 25 GB/s. Multiply
>by the number of cores, let's say 4 or 6, and you reach 100
>or 150 GB/s. That's only 6-10x less than the 1 TB/s figure
>you quoted for ATI, less impressive heh? :)
>
>I might be wrong about the i7 being able to read 64b/cycle
>from its L1 cache but even if we keep the figure of
>12 GB/sec, one has to multiply it by the number of cores, 4
>or 6. Still very far from 100x...
In fact your are wrong in the opposite direction.
Nehalem can simultaneously read and write 16B/cycle from L1 cache. And you don't even need SSE for that. Simple rep movsd with big enough n will do it too.
So for 4 cores, 3 GHz you have 384 GB/s for copy or 192 GB/s for read. And in tight computational cores practice quite commonly could be within 75% of theory.
---------------------------
>Richard Cownie (tich@pobox.com) on 8/3/10 wrote:
>---------------------------
>[...]
>>http://arstechnica.com/hardware/reviews/2008/11/nehalem-launch-review.ars/4
>>>
>>
>>... showing cache bandwidth of about 12GB/sec (though
>>maybe you could boost that using multiple threads ?)
>>
>>And you get factor of 1024/12 which is about 85x.
>
>This looks half of what it should be, as I'd expect one
>64-bit read/cycle which would give about 25 GB/s. Multiply
>by the number of cores, let's say 4 or 6, and you reach 100
>or 150 GB/s. That's only 6-10x less than the 1 TB/s figure
>you quoted for ATI, less impressive heh? :)
>
>I might be wrong about the i7 being able to read 64b/cycle
>from its L1 cache but even if we keep the figure of
>12 GB/sec, one has to multiply it by the number of cores, 4
>or 6. Still very far from 100x...
In fact your are wrong in the opposite direction.
Nehalem can simultaneously read and write 16B/cycle from L1 cache. And you don't even need SSE for that. Simple rep movsd with big enough n will do it too.
So for 4 cores, 3 GHz you have 384 GB/s for copy or 192 GB/s for read. And in tight computational cores practice quite commonly could be within 75% of theory.