By: Travis (travis.downs.delete@this.gmail.com), December 21, 2017 8:35 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on December 21, 2017 1:52 am wrote:
> Micahel S (already5chosen.delete@this.yahoo.com) on December 20, 2017 2:53 pm wrote:
> > Bridges? Wells?
>
> Yes, Bridges and Wells show the same behavior.
> 13.5 cycles per iteration on Ivy Bridge (i7-3770)
> 14 cycles per iteration on Haswell (E3-1271 v3)
Huh - those are the values for the loop with 2 stores? What do you get for 1 store? How many iterations are you averaging over? If you print out the results for shorter loops I wonder if you see the bimodal behavior: i.e., if 13.5 is the average between (longish) periods of ~9 cycles and periods of ~18 cycles (that's what I see on Skylake).
>
> So my first thought (associativity conflict, due to SKL L2 having fewer ways than L1D) proves wrong.
>
> However I still think that for some reason approximately 30-50% of the stores that
> shell be going to L2 end up in main memory and most of the rest goes to LLC.
>
> Unlike me, you like to read performance counters. What do they say?
They show that the loops are bottlenecked on store buffer entries, which makes sense: the stores are committing slowly so the bottleneck as observed by the core will always be "SB full". It doesn't tell us much.
The counters dealing with hits and misses mostly tell the expected story: the expected number of references to L2 and essentially no requests L3 or DRAM, so does not appear anything is going to those higher levels (and indeed the results are probably "too fast" for that).
You also find that while there are many fine-grained and interesting counters for all sorts of load stalls, stores definitely get the short end of the stick, so in general you don't have very much visibilities into stores. At the L2 level there are l2_rqsts.all_rfo, and l2_rqsts.rfo_miss and l2_rqsts.rfo_hit, but this apparently aren't triggered by stores to lines already modified in L2, so they are close to zero (maybe they trigger only when the state had to be changed from something else to M in L2, or maybe they are counting RFO requests from other cores that probe this L2). You only see the stores indirectly though l2_rqsts.references and it has the expected number.
> Micahel S (already5chosen.delete@this.yahoo.com) on December 20, 2017 2:53 pm wrote:
> > Bridges? Wells?
>
> Yes, Bridges and Wells show the same behavior.
> 13.5 cycles per iteration on Ivy Bridge (i7-3770)
> 14 cycles per iteration on Haswell (E3-1271 v3)
Huh - those are the values for the loop with 2 stores? What do you get for 1 store? How many iterations are you averaging over? If you print out the results for shorter loops I wonder if you see the bimodal behavior: i.e., if 13.5 is the average between (longish) periods of ~9 cycles and periods of ~18 cycles (that's what I see on Skylake).
>
> So my first thought (associativity conflict, due to SKL L2 having fewer ways than L1D) proves wrong.
>
> However I still think that for some reason approximately 30-50% of the stores that
> shell be going to L2 end up in main memory and most of the rest goes to LLC.
>
> Unlike me, you like to read performance counters. What do they say?
They show that the loops are bottlenecked on store buffer entries, which makes sense: the stores are committing slowly so the bottleneck as observed by the core will always be "SB full". It doesn't tell us much.
The counters dealing with hits and misses mostly tell the expected story: the expected number of references to L2 and essentially no requests L3 or DRAM, so does not appear anything is going to those higher levels (and indeed the results are probably "too fast" for that).
You also find that while there are many fine-grained and interesting counters for all sorts of load stalls, stores definitely get the short end of the stick, so in general you don't have very much visibilities into stores. At the L2 level there are l2_rqsts.all_rfo, and l2_rqsts.rfo_miss and l2_rqsts.rfo_hit, but this apparently aren't triggered by stores to lines already modified in L2, so they are close to zero (maybe they trigger only when the state had to be changed from something else to M in L2, or maybe they are counting RFO requests from other cores that probe this L2). You only see the stores indirectly though l2_rqsts.references and it has the expected number.