By: MS (ms.delete@this.lostcircuits.com), January 18, 2011 8:11 pm
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
---------------------------
>MS (ms@lostcircuits.com) on 1/18/11 wrote:
>---------------------------
>>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>>---------------------------
>>>MS (ms@lostcircuits.com) on 1/18/11 wrote:
>>>---------------------------
>>>>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>>>>---------------------------
>>>>
>>>>
>>>>>
>>>>>Look at the bandwidth for a 512KB or 1MB data set. That's large enough to spill
>>>>>into the L2 cache (128KB or 256KB/core). The respective bandwidth numbers are ~300GB/s
>>>>>and 250GB/s for a 3.4GHz part that can hit a peak of 3.8GHz with 4 cores active.
>>>>>
>>>>>300GB/s --> 75.0GB/s per core --> 19.7-22.1B per core*cycle
>>>>>
>>>>>250GB/s --> 62.5GB/s per core --> 16.4-18.4B per core*cycle
>>>>>
>>>>>Both of these numbers suggest that the theoretical maximum must be above 16B/cycle,
>>>>>since the test will not hit 100% of peak.
>>>>>
>>>>
>>>>>David
>>>>
>>>>You cannot fit a 512kB data set into a 256kB cache, the >numbers you were looking
>>>>at appear to be LLC rather than L2. L2 numbers are >everything from 32kB to 256.
>>>
>>>My understanding is that it's a total of 512KB that is split between the 4 different
>>>caches. I don't really place much faith in tools like Sandra in the first place,
>>>and they certainly do little to explain the precise nature of the tests.
>>>
>>>>I ran the benchmark, though on a single thread and I am >getting 75GB/sec for both
>>>>L1D and L2, which comes out to 21.8 bytes/ cycle (at >3.7GHz) so you are probably right about the 32-byte path.
>>>
>>>Do you have any way to verify that the hits are occurring in the L2 and not L1D?
>>>Or are you using a larger data set that is only fully resident in the L2?
>>>
>>>David
>>
>>Sandra uses an adaptation of Stream just like everybody else and I am actually
>>talking quite often to Adrian regarding some of the benchmarks and for the most
>>part they are just as good as any other esoteric bench.
>>
>>Each core has a 256 kB discrete L2 cache which gives a combined 1MB L2 but they
>>are discrete and you cannot span data across them, which is the fundamental difference
>>to a shared cache like the L3 or LLC.
>
>I'm aware : )
>
>>If it is a data set, then that is one "coherent data structure" which means that
>>if there are discrete caches for the different cores, the data set cannot span across
>>core boundaries.
>>In other words, the max size that fits into the L2 cache is 256kB
>>for each data set. Similarly, any data structure that is larger than 32kB will
>>not fit into the L1D but has to go into the L2 cache.
>
>Yes, that's true, but you can fit 32KB of your large structure into the L1 cache.
>If many of your accesses fall within the 32KB in the L1 cache, the bandwidth numbers
>may be skewed upwards. IOW, you want to be sure that the accesses are missing in
>the L1 cache and hitting in the L2 cache. Although the L1 read bandwidth is similar
>to the L2 bandwidth, the latency will have an impact on achievable bandwidth as will the number of in-flight misses.
>
>Put another way - there are plenty of ways to access a 256KB data structure in
>such a way that the majority of reads are serviced from the L1 cache. Especially with a clever prefetcher.
>
>I would think that Sandra is designed to avoid such things, but it's frankly very difficult to tell.
>
>>Does that answer your question?
>
>Not quite. It sounds like you were using a 256KB data set with a perfectly strided
>access pattern for your test, is that correct?
>
>DK
As I mentioned, Sandra uses STREAM, that is a linear access pattern, so yes.
---------------------------
>MS (ms@lostcircuits.com) on 1/18/11 wrote:
>---------------------------
>>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>>---------------------------
>>>MS (ms@lostcircuits.com) on 1/18/11 wrote:
>>>---------------------------
>>>>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>>>>---------------------------
>>>>
>>>>
>>>>>
>>>>>Look at the bandwidth for a 512KB or 1MB data set. That's large enough to spill
>>>>>into the L2 cache (128KB or 256KB/core). The respective bandwidth numbers are ~300GB/s
>>>>>and 250GB/s for a 3.4GHz part that can hit a peak of 3.8GHz with 4 cores active.
>>>>>
>>>>>300GB/s --> 75.0GB/s per core --> 19.7-22.1B per core*cycle
>>>>>
>>>>>250GB/s --> 62.5GB/s per core --> 16.4-18.4B per core*cycle
>>>>>
>>>>>Both of these numbers suggest that the theoretical maximum must be above 16B/cycle,
>>>>>since the test will not hit 100% of peak.
>>>>>
>>>>
>>>>>David
>>>>
>>>>You cannot fit a 512kB data set into a 256kB cache, the >numbers you were looking
>>>>at appear to be LLC rather than L2. L2 numbers are >everything from 32kB to 256.
>>>
>>>My understanding is that it's a total of 512KB that is split between the 4 different
>>>caches. I don't really place much faith in tools like Sandra in the first place,
>>>and they certainly do little to explain the precise nature of the tests.
>>>
>>>>I ran the benchmark, though on a single thread and I am >getting 75GB/sec for both
>>>>L1D and L2, which comes out to 21.8 bytes/ cycle (at >3.7GHz) so you are probably right about the 32-byte path.
>>>
>>>Do you have any way to verify that the hits are occurring in the L2 and not L1D?
>>>Or are you using a larger data set that is only fully resident in the L2?
>>>
>>>David
>>
>>Sandra uses an adaptation of Stream just like everybody else and I am actually
>>talking quite often to Adrian regarding some of the benchmarks and for the most
>>part they are just as good as any other esoteric bench.
>>
>>Each core has a 256 kB discrete L2 cache which gives a combined 1MB L2 but they
>>are discrete and you cannot span data across them, which is the fundamental difference
>>to a shared cache like the L3 or LLC.
>
>I'm aware : )
>
>>If it is a data set, then that is one "coherent data structure" which means that
>>if there are discrete caches for the different cores, the data set cannot span across
>>core boundaries.
>>In other words, the max size that fits into the L2 cache is 256kB
>>for each data set. Similarly, any data structure that is larger than 32kB will
>>not fit into the L1D but has to go into the L2 cache.
>
>Yes, that's true, but you can fit 32KB of your large structure into the L1 cache.
>If many of your accesses fall within the 32KB in the L1 cache, the bandwidth numbers
>may be skewed upwards. IOW, you want to be sure that the accesses are missing in
>the L1 cache and hitting in the L2 cache. Although the L1 read bandwidth is similar
>to the L2 bandwidth, the latency will have an impact on achievable bandwidth as will the number of in-flight misses.
>
>Put another way - there are plenty of ways to access a 256KB data structure in
>such a way that the majority of reads are serviced from the L1 cache. Especially with a clever prefetcher.
>
>I would think that Sandra is designed to avoid such things, but it's frankly very difficult to tell.
>
>>Does that answer your question?
>
>Not quite. It sounds like you were using a 256KB data set with a perfectly strided
>access pattern for your test, is that correct?
>
>DK
As I mentioned, Sandra uses STREAM, that is a linear access pattern, so yes.



