By: David Kanter (dkanter.delete@this.realworldtech.com), January 18, 2011 6:10 pm
Room: Moderated Discussions
MS (ms@lostcircuits.com) on 1/18/11 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>---------------------------
>>MS (ms@lostcircuits.com) on 1/18/11 wrote:
>>---------------------------
>>>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>>>---------------------------
>>>
>>>
>>>>
>>>>Look at the bandwidth for a 512KB or 1MB data set. That's large enough to spill
>>>>into the L2 cache (128KB or 256KB/core). The respective bandwidth numbers are ~300GB/s
>>>>and 250GB/s for a 3.4GHz part that can hit a peak of 3.8GHz with 4 cores active.
>>>>
>>>>300GB/s --> 75.0GB/s per core --> 19.7-22.1B per core*cycle
>>>>
>>>>250GB/s --> 62.5GB/s per core --> 16.4-18.4B per core*cycle
>>>>
>>>>Both of these numbers suggest that the theoretical maximum must be above 16B/cycle,
>>>>since the test will not hit 100% of peak.
>>>>
>>>
>>>>David
>>>
>>>You cannot fit a 512kB data set into a 256kB cache, the >numbers you were looking
>>>at appear to be LLC rather than L2. L2 numbers are >everything from 32kB to 256.
>>
>>My understanding is that it's a total of 512KB that is split between the 4 different
>>caches. I don't really place much faith in tools like Sandra in the first place,
>>and they certainly do little to explain the precise nature of the tests.
>>
>>>I ran the benchmark, though on a single thread and I am >getting 75GB/sec for both
>>>L1D and L2, which comes out to 21.8 bytes/ cycle (at >3.7GHz) so you are probably right about the 32-byte path.
>>
>>Do you have any way to verify that the hits are occurring in the L2 and not L1D?
>>Or are you using a larger data set that is only fully resident in the L2?
>>
>>David
>
>Sandra uses an adaptation of Stream just like everybody else and I am actually
>talking quite often to Adrian regarding some of the benchmarks and for the most
>part they are just as good as any other esoteric bench.
>
>Each core has a 256 kB discrete L2 cache which gives a combined 1MB L2 but they
>are discrete and you cannot span data across them, which is the fundamental difference
>to a shared cache like the L3 or LLC.
I'm aware : )
>If it is a data set, then that is one "coherent data structure" which means that
>if there are discrete caches for the different cores, the data set cannot span across
>core boundaries.
>In other words, the max size that fits into the L2 cache is 256kB
>for each data set. Similarly, any data structure that is larger than 32kB will
>not fit into the L1D but has to go into the L2 cache.
Yes, that's true, but you can fit 32KB of your large structure into the L1 cache. If many of your accesses fall within the 32KB in the L1 cache, the bandwidth numbers may be skewed upwards. IOW, you want to be sure that the accesses are missing in the L1 cache and hitting in the L2 cache. Although the L1 read bandwidth is similar to the L2 bandwidth, the latency will have an impact on achievable bandwidth as will the number of in-flight misses.
Put another way - there are plenty of ways to access a 256KB data structure in such a way that the majority of reads are serviced from the L1 cache. Especially with a clever prefetcher.
I would think that Sandra is designed to avoid such things, but it's frankly very difficult to tell.
>Does that answer your question?
Not quite. It sounds like you were using a 256KB data set with a perfectly strided access pattern for your test, is that correct?
DK
---------------------------
>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>---------------------------
>>MS (ms@lostcircuits.com) on 1/18/11 wrote:
>>---------------------------
>>>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>>>---------------------------
>>>
>>>
>>>>
>>>>Look at the bandwidth for a 512KB or 1MB data set. That's large enough to spill
>>>>into the L2 cache (128KB or 256KB/core). The respective bandwidth numbers are ~300GB/s
>>>>and 250GB/s for a 3.4GHz part that can hit a peak of 3.8GHz with 4 cores active.
>>>>
>>>>300GB/s --> 75.0GB/s per core --> 19.7-22.1B per core*cycle
>>>>
>>>>250GB/s --> 62.5GB/s per core --> 16.4-18.4B per core*cycle
>>>>
>>>>Both of these numbers suggest that the theoretical maximum must be above 16B/cycle,
>>>>since the test will not hit 100% of peak.
>>>>
>>>
>>>>David
>>>
>>>You cannot fit a 512kB data set into a 256kB cache, the >numbers you were looking
>>>at appear to be LLC rather than L2. L2 numbers are >everything from 32kB to 256.
>>
>>My understanding is that it's a total of 512KB that is split between the 4 different
>>caches. I don't really place much faith in tools like Sandra in the first place,
>>and they certainly do little to explain the precise nature of the tests.
>>
>>>I ran the benchmark, though on a single thread and I am >getting 75GB/sec for both
>>>L1D and L2, which comes out to 21.8 bytes/ cycle (at >3.7GHz) so you are probably right about the 32-byte path.
>>
>>Do you have any way to verify that the hits are occurring in the L2 and not L1D?
>>Or are you using a larger data set that is only fully resident in the L2?
>>
>>David
>
>Sandra uses an adaptation of Stream just like everybody else and I am actually
>talking quite often to Adrian regarding some of the benchmarks and for the most
>part they are just as good as any other esoteric bench.
>
>Each core has a 256 kB discrete L2 cache which gives a combined 1MB L2 but they
>are discrete and you cannot span data across them, which is the fundamental difference
>to a shared cache like the L3 or LLC.
I'm aware : )
>If it is a data set, then that is one "coherent data structure" which means that
>if there are discrete caches for the different cores, the data set cannot span across
>core boundaries.
>In other words, the max size that fits into the L2 cache is 256kB
>for each data set. Similarly, any data structure that is larger than 32kB will
>not fit into the L1D but has to go into the L2 cache.
Yes, that's true, but you can fit 32KB of your large structure into the L1 cache. If many of your accesses fall within the 32KB in the L1 cache, the bandwidth numbers may be skewed upwards. IOW, you want to be sure that the accesses are missing in the L1 cache and hitting in the L2 cache. Although the L1 read bandwidth is similar to the L2 bandwidth, the latency will have an impact on achievable bandwidth as will the number of in-flight misses.
Put another way - there are plenty of ways to access a 256KB data structure in such a way that the majority of reads are serviced from the L1 cache. Especially with a clever prefetcher.
I would think that Sandra is designed to avoid such things, but it's frankly very difficult to tell.
>Does that answer your question?
Not quite. It sounds like you were using a 256KB data set with a perfectly strided access pattern for your test, is that correct?
DK



