By: MS (ms.delete@this.lostcircuits.com), January 19, 2011 2:32 pm
Room: Moderated Discussions
MS (ms@lostcircuits.com) on 1/18/11 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>---------------------------
>>MS (ms@lostcircuits.com) on 1/18/11 wrote:
>>---------------------------
>>>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>>>---------------------------
>>>>MS (ms@lostcircuits.com) on 1/18/11 wrote:
>>>>---------------------------
>>>>>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>>>>>---------------------------
>>>>>
>>>>>
>>>>>>
>>>>>>Look at the bandwidth for a 512KB or 1MB data set. That's large enough to spill
>>>>>>into the L2 cache (128KB or 256KB/core). The respective bandwidth numbers are ~300GB/s
>>>>>>and 250GB/s for a 3.4GHz part that can hit a peak of 3.8GHz with 4 cores active.
>>>>>>
>>>>>>300GB/s --> 75.0GB/s per core --> 19.7-22.1B per core*cycle
>>>>>>
>>>>>>250GB/s --> 62.5GB/s per core --> 16.4-18.4B per core*cycle
>>>>>>
>>>>>>Both of these numbers suggest that the theoretical maximum must be above 16B/cycle,
>>>>>>since the test will not hit 100% of peak.
>>>>>>
>>>>>
>>>>>>David
>>>>>
>>>>>You cannot fit a 512kB data set into a 256kB cache, the >numbers you were looking
>>>>>at appear to be LLC rather than L2. L2 numbers are >everything from 32kB to 256.
>>>>
>>>>My understanding is that it's a total of 512KB that is split between the 4 different
>>>>caches. I don't really place much faith in tools like Sandra in the first place,
>>>>and they certainly do little to explain the precise nature of the tests.
>>>>
>>>>>I ran the benchmark, though on a single thread and I am >getting 75GB/sec for both
>>>>>L1D and L2, which comes out to 21.8 bytes/ cycle (at >3.7GHz) so you are probably right about the 32-byte path.
>>>>
>>>>Do you have any way to verify that the hits are occurring in the L2 and not L1D?
>>>>Or are you using a larger data set that is only fully resident in the L2?
>>>>
>>>>David
>>>
>>>Sandra uses an adaptation of Stream just like everybody else and I am actually
>>>talking quite often to Adrian regarding some of the benchmarks and for the most
>>>part they are just as good as any other esoteric bench.
>>>
>>>Each core has a 256 kB discrete L2 cache which gives a combined 1MB L2 but they
>>>are discrete and you cannot span data across them, which is the fundamental difference
>>>to a shared cache like the L3 or LLC.
>>
>>I'm aware : )
>>
>>>If it is a data set, then that is one "coherent data structure" which means that
>>>if there are discrete caches for the different cores, the data set cannot span across
>>>core boundaries.
>>>In other words, the max size that fits into the L2 cache is 256kB
>>>for each data set. Similarly, any data structure that is larger than 32kB will
>>>not fit into the L1D but has to go into the L2 cache.
>>
>>Yes, that's true, but you can fit 32KB of your large structure into the L1 cache.
>>If many of your accesses fall within the 32KB in the L1 cache, the bandwidth numbers
>>may be skewed upwards. IOW, you want to be sure that the accesses are missing in
>>the L1 cache and hitting in the L2 cache. Although the L1 read bandwidth is similar
>>to the L2 bandwidth, the latency will have an impact on achievable bandwidth as will the number of in-flight misses.
>>
>>Put another way - there are plenty of ways to access a 256KB data structure in
>>such a way that the majority of reads are serviced from the L1 cache. Especially with a clever prefetcher.
>>
>>I would think that Sandra is designed to avoid such things, but it's frankly very difficult to tell.
>>
>>>Does that answer your question?
>>
>>Not quite. It sounds like you were using a 256KB data set with a perfectly strided
>>access pattern for your test, is that correct?
>>
>>DK
>
>As I mentioned, Sandra uses STREAM, that is a linear access pattern, so yes.
>
I checked, and you were, indeed, correct with respect to the size of the data blocks in Sandra which is given in form of the aggregate block size, i.e. test block x number of threads if you run the default configuration. When you disable multithreading and hyperthreading, then the size refers to the individual data set.
---------------------------
>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>---------------------------
>>MS (ms@lostcircuits.com) on 1/18/11 wrote:
>>---------------------------
>>>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>>>---------------------------
>>>>MS (ms@lostcircuits.com) on 1/18/11 wrote:
>>>>---------------------------
>>>>>David Kanter (dkanter@realworldtech.com) on 1/18/11 wrote:
>>>>>---------------------------
>>>>>
>>>>>
>>>>>>
>>>>>>Look at the bandwidth for a 512KB or 1MB data set. That's large enough to spill
>>>>>>into the L2 cache (128KB or 256KB/core). The respective bandwidth numbers are ~300GB/s
>>>>>>and 250GB/s for a 3.4GHz part that can hit a peak of 3.8GHz with 4 cores active.
>>>>>>
>>>>>>300GB/s --> 75.0GB/s per core --> 19.7-22.1B per core*cycle
>>>>>>
>>>>>>250GB/s --> 62.5GB/s per core --> 16.4-18.4B per core*cycle
>>>>>>
>>>>>>Both of these numbers suggest that the theoretical maximum must be above 16B/cycle,
>>>>>>since the test will not hit 100% of peak.
>>>>>>
>>>>>
>>>>>>David
>>>>>
>>>>>You cannot fit a 512kB data set into a 256kB cache, the >numbers you were looking
>>>>>at appear to be LLC rather than L2. L2 numbers are >everything from 32kB to 256.
>>>>
>>>>My understanding is that it's a total of 512KB that is split between the 4 different
>>>>caches. I don't really place much faith in tools like Sandra in the first place,
>>>>and they certainly do little to explain the precise nature of the tests.
>>>>
>>>>>I ran the benchmark, though on a single thread and I am >getting 75GB/sec for both
>>>>>L1D and L2, which comes out to 21.8 bytes/ cycle (at >3.7GHz) so you are probably right about the 32-byte path.
>>>>
>>>>Do you have any way to verify that the hits are occurring in the L2 and not L1D?
>>>>Or are you using a larger data set that is only fully resident in the L2?
>>>>
>>>>David
>>>
>>>Sandra uses an adaptation of Stream just like everybody else and I am actually
>>>talking quite often to Adrian regarding some of the benchmarks and for the most
>>>part they are just as good as any other esoteric bench.
>>>
>>>Each core has a 256 kB discrete L2 cache which gives a combined 1MB L2 but they
>>>are discrete and you cannot span data across them, which is the fundamental difference
>>>to a shared cache like the L3 or LLC.
>>
>>I'm aware : )
>>
>>>If it is a data set, then that is one "coherent data structure" which means that
>>>if there are discrete caches for the different cores, the data set cannot span across
>>>core boundaries.
>>>In other words, the max size that fits into the L2 cache is 256kB
>>>for each data set. Similarly, any data structure that is larger than 32kB will
>>>not fit into the L1D but has to go into the L2 cache.
>>
>>Yes, that's true, but you can fit 32KB of your large structure into the L1 cache.
>>If many of your accesses fall within the 32KB in the L1 cache, the bandwidth numbers
>>may be skewed upwards. IOW, you want to be sure that the accesses are missing in
>>the L1 cache and hitting in the L2 cache. Although the L1 read bandwidth is similar
>>to the L2 bandwidth, the latency will have an impact on achievable bandwidth as will the number of in-flight misses.
>>
>>Put another way - there are plenty of ways to access a 256KB data structure in
>>such a way that the majority of reads are serviced from the L1 cache. Especially with a clever prefetcher.
>>
>>I would think that Sandra is designed to avoid such things, but it's frankly very difficult to tell.
>>
>>>Does that answer your question?
>>
>>Not quite. It sounds like you were using a 256KB data set with a perfectly strided
>>access pattern for your test, is that correct?
>>
>>DK
>
>As I mentioned, Sandra uses STREAM, that is a linear access pattern, so yes.
>
I checked, and you were, indeed, correct with respect to the size of the data blocks in Sandra which is given in form of the aggregate block size, i.e. test block x number of threads if you run the default configuration. When you disable multithreading and hyperthreading, then the size refers to the individual data set.



