By: Michael S (already5chosen.delete@this.yahoo.com), December 21, 2017 9:00 am
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on December 21, 2017 8:35 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on December 21, 2017 1:52 am wrote:
> > Micahel S (already5chosen.delete@this.yahoo.com) on December 20, 2017 2:53 pm wrote:
> > > Bridges? Wells?
> >
> > Yes, Bridges and Wells show the same behavior.
> > 13.5 cycles per iteration on Ivy Bridge (i7-3770)
> > 14 cycles per iteration on Haswell (E3-1271 v3)
>
> Huh - those are the values for the loop with 2 stores?
Yes.
> What do you get for 1 store?
5.5 or 6.
> How
> many iterations are you averaging over?
100M iterations total. So, for 64K buffers, I run the measurement ~100K time
> If you print out the results for shorter loops
> I wonder if you see the bimodal behavior: i.e., if 13.5 is the average between (longish)
> periods of ~9 cycles and periods of ~18 cycles (that's what I see on Skylake).
I am not *that* interested.
>
> >
> > So my first thought (associativity conflict, due to SKL L2 having fewer ways than L1D) proves wrong.
> >
> > However I still think that for some reason approximately 30-50% of the stores that
> > shell be going to L2 end up in main memory and most of the rest goes to LLC.
> >
> > Unlike me, you like to read performance counters. What do they say?
>
> They show that the loops are bottlenecked on store buffer entries, which makes sense: the stores are committing
> slowly so the bottleneck as observed by the core will always be "SB full". It doesn't tell us much.
>
> The counters dealing with hits and misses mostly tell the expected story: the expected number
> of references to L2 and essentially no requests L3 or DRAM, so does not appear anything is
> going to those higher levels (and indeed the results are probably "too fast" for that).
No, not too fast.
On my Haswell I see that LLC is capable to store a cache line approximately every 7.5 clocks.
But if the counters say that it is not happening then I have to believe the counters.
>
> You also find that while there are many fine-grained and interesting counters for all sorts of load
> stalls, stores definitely get the short end of the stick, so in general you don't have very much
> visibilities into stores. At the L2 level there are l2_rqsts.all_rfo, and l2_rqsts.rfo_miss and l2_rqsts.rfo_hit,
> but this apparently aren't triggered by stores to lines already modified in L2, so they are close
> to zero (maybe they trigger only when the state had to be changed from something else to M in L2,
> or maybe they are counting RFO requests from other cores that probe this L2). You only see the stores
> indirectly though l2_rqsts.references and it has the expected number.
>
>
> Michael S (already5chosen.delete@this.yahoo.com) on December 21, 2017 1:52 am wrote:
> > Micahel S (already5chosen.delete@this.yahoo.com) on December 20, 2017 2:53 pm wrote:
> > > Bridges? Wells?
> >
> > Yes, Bridges and Wells show the same behavior.
> > 13.5 cycles per iteration on Ivy Bridge (i7-3770)
> > 14 cycles per iteration on Haswell (E3-1271 v3)
>
> Huh - those are the values for the loop with 2 stores?
Yes.
> What do you get for 1 store?
5.5 or 6.
> How
> many iterations are you averaging over?
100M iterations total. So, for 64K buffers, I run the measurement ~100K time
> If you print out the results for shorter loops
> I wonder if you see the bimodal behavior: i.e., if 13.5 is the average between (longish)
> periods of ~9 cycles and periods of ~18 cycles (that's what I see on Skylake).
I am not *that* interested.
>
> >
> > So my first thought (associativity conflict, due to SKL L2 having fewer ways than L1D) proves wrong.
> >
> > However I still think that for some reason approximately 30-50% of the stores that
> > shell be going to L2 end up in main memory and most of the rest goes to LLC.
> >
> > Unlike me, you like to read performance counters. What do they say?
>
> They show that the loops are bottlenecked on store buffer entries, which makes sense: the stores are committing
> slowly so the bottleneck as observed by the core will always be "SB full". It doesn't tell us much.
>
> The counters dealing with hits and misses mostly tell the expected story: the expected number
> of references to L2 and essentially no requests L3 or DRAM, so does not appear anything is
> going to those higher levels (and indeed the results are probably "too fast" for that).
No, not too fast.
On my Haswell I see that LLC is capable to store a cache line approximately every 7.5 clocks.
But if the counters say that it is not happening then I have to believe the counters.
>
> You also find that while there are many fine-grained and interesting counters for all sorts of load
> stalls, stores definitely get the short end of the stick, so in general you don't have very much
> visibilities into stores. At the L2 level there are l2_rqsts.all_rfo, and l2_rqsts.rfo_miss and l2_rqsts.rfo_hit,
> but this apparently aren't triggered by stores to lines already modified in L2, so they are close
> to zero (maybe they trigger only when the state had to be changed from something else to M in L2,
> or maybe they are counting RFO requests from other cores that probe this L2). You only see the stores
> indirectly though l2_rqsts.references and it has the expected number.
>
>