By: David Kanter (dkanter.delete@this.realworldtech.com), January 18, 2011 3:58 pm
Room: Moderated Discussions
MS (ms@lostcircuits.com) on 1/18/11 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>---------------------------
>
>
>>
>>Look at the bandwidth for a 512KB or 1MB data set. That's large enough to spill
>>into the L2 cache (128KB or 256KB/core). The respective bandwidth numbers are ~300GB/s
>>and 250GB/s for a 3.4GHz part that can hit a peak of 3.8GHz with 4 cores active.
>>
>>300GB/s --> 75.0GB/s per core --> 19.7-22.1B per core*cycle
>>
>>250GB/s --> 62.5GB/s per core --> 16.4-18.4B per core*cycle
>>
>>Both of these numbers suggest that the theoretical maximum must be above 16B/cycle,
>>since the test will not hit 100% of peak.
>>
>
>>David
>
>You cannot fit a 512kB data set into a 256kB cache, the >numbers you were looking
>at appear to be LLC rather than L2. L2 numbers are >everything from 32kB to 256.
My understanding is that it's a total of 512KB that is split between the 4 different caches. I don't really place much faith in tools like Sandra in the first place, and they certainly do little to explain the precise nature of the tests.
>I ran the benchmark, though on a single thread and I am >getting 75GB/sec for both
>L1D and L2, which comes out to 21.8 bytes/ cycle (at >3.7GHz) so you are probably right about the 32-byte path.
Do you have any way to verify that the hits are occurring in the L2 and not L1D? Or are you using a larger data set that is only fully resident in the L2?
David
---------------------------
>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>---------------------------
>
>
>>
>>Look at the bandwidth for a 512KB or 1MB data set. That's large enough to spill
>>into the L2 cache (128KB or 256KB/core). The respective bandwidth numbers are ~300GB/s
>>and 250GB/s for a 3.4GHz part that can hit a peak of 3.8GHz with 4 cores active.
>>
>>300GB/s --> 75.0GB/s per core --> 19.7-22.1B per core*cycle
>>
>>250GB/s --> 62.5GB/s per core --> 16.4-18.4B per core*cycle
>>
>>Both of these numbers suggest that the theoretical maximum must be above 16B/cycle,
>>since the test will not hit 100% of peak.
>>
>
>>David
>
>You cannot fit a 512kB data set into a 256kB cache, the >numbers you were looking
>at appear to be LLC rather than L2. L2 numbers are >everything from 32kB to 256.
My understanding is that it's a total of 512KB that is split between the 4 different caches. I don't really place much faith in tools like Sandra in the first place, and they certainly do little to explain the precise nature of the tests.
>I ran the benchmark, though on a single thread and I am >getting 75GB/sec for both
>L1D and L2, which comes out to 21.8 bytes/ cycle (at >3.7GHz) so you are probably right about the 32-byte path.
Do you have any way to verify that the hits are occurring in the L2 and not L1D? Or are you using a larger data set that is only fully resident in the L2?
David



