Undocumented L1D cache event

By: Travis Downs (travis.downs.delete@this.gmail.com), May 1, 2019 4:06 pm
Room: Moderated Discussions
Recent Intel chips provide a couple of interesting events at event code 0x48: l1d_pend_miss.pending (umask 0x1), and l1d_pend_miss.fb_full (umask 0x2).

The former counts the number of outstanding misses for cache lines *required by a demand load* each cycle (using cmask=1 you get the total number of cycles that at least one such miss is outstanding, Intel gives this a separate event name: l1d_pend_miss.pending_cycles).

The latter (l1d_pend_miss.fb_full) counts the number of times execution of a memory instruction was blocked due to no fill buffers available. In Intel's words:

Number of times a request needed a FB entry but there was no entry
available for it. That is the FB unavailability was dominant reason
for blocking the request. A request includes cacheable/uncacheable
demands that is load, store or SW prefetch.

This one is a bit harder to characterize because it depends how often memory access uops "retry" when they initially encounter fill buffer exhaustion, but empirically it counts 1-2 per cycle when there are loads waiting for fill buffers, but much less for stores (about 1 count every 5 cycles for a long stream of stores that miss to RAM).

Those preliminaries out of the way, one might ask whether the other umask values (0x4 and 0x8) hide any interesting undocumented events. I tried and 0x8 was a bust, it always reports 0. umask=0x4 is more interesting. It reports counts that correlate roughly with L1D misses across various uarch-bench benchmarks.

After a bit more faffing around I've decided that event=0x48 umask=0x4 counts *fill buffer allocations*. That is, every time a fill buffer is allocated, for whatever reason, this increments by 1.

This can be a useful event. It is counts something different than any other event, although it is often highly correlated with other events. Some notes:

  • Like the other 0x48 events it counts at the L1D, so it is not affected by things that occur at the L2, such as L2 prefetches.

  • It seems to count separately L1 writebacks, so a load or store that misses and evicts a dirty line counts as 2 (one for the miss and one for the writeback). It isn't clear to me if L1L2 writeback buffers are a separate pool, or use the line fill buffers, but either way this event seems to count them. This is one of the only ways to count L1 writebacks, which are mostly invisible in the performance count
  • It is somewhat similar to mem_load_retired.l1_miss, in that every load l1 miss will cause a FB allocation, but that even only counts misses that originate from loads. It ignores misses from stores and software or hardware prefetches. In particular, if a software/hardware prefetch starts bringing in a line (L1 miss) and then a load instruction subsequently hits that line (even immediately, before the line arrives) the l1_miss event doesn't trigger. The new event helps you understand fill buffer allocation from all sources.

  • It is also similar to the l2_rqsts.references counter, since most l2 requests originate from l1 misses. However the L2 event also includes L2 prefetches (which can be a lot, sometimes more than all other L2 requests combined), and doesn't count L2 accesses due to L1 writebacks.

  • It counts split loads (loads that cross a cache line boundary) as expected: two fill buffers are allocated. I mention this because I found that the mem_load_retired events *don't* count split loads as you might expect: unlike non-split loads, even if all your split loads miss, you get essentially zero mem_load_retired.l1_miss counts: your loads all show up as mem_load_retired.fb_hit or mem_load_retired.l1_hit. Probably what happens is the split load executes, it detected as a split load (and the two line loads are triggered) and then the uop is replayed (perhaps split into two uops since I see 2x as many uops going to p2 and p3) and at that point you get a FB hit since the load is already in progress. All that is a long way of saying that l1_miss event doesn't give a reliable picture of misses in the presence of split loads - but the fill buffer allocation event isn't fooled. I presume the same applies to split stores but I didn't test.

  • Like the other 0x48 events, this isn't tied to instruction retirement or execution, so I expect it to count speculatively as well, i.e., allocations that are on a mis-speculated path will still be counted (although I didn't test this).

  • I tried this event on Skylake-S and Skylake-SP and the behavior was the same across all the platforms, so it is "generally available" across most recent Intel stuff.

  • I also tried it on Cannonlake, but it counts something totally different there - still correlated with L1 misses but nothing like fill buffer allocations. Generally the value was about 50-150% of the l1_pend_miss.fb_full event. BTW CNL still counts "always 0" for umask=0x8. The whole L1 miss path seems to have been designed on CNL, so it's possible this event got changed due to that. Intel hasn't released any performance counter event for CNL at all as far as I can tell, but most Skylake events still work. This one will remain a mystery for another day.

I can't be sure the event is actually counting fill buffer allocations, but at least all the numbers I've seen are consistent with that interpretation.

Maybe you will find a use for this event one day! Of course, as a undocumented event it comes with no warranty, it may damage your CPU or send bolts of lightning out of the USB port to fry your cat as punishment for using it, but so far so good for my use cases.
 Next Post in Thread >
TopicPosted ByDate
Undocumented L1D cache eventTravis Downs2019/05/01 04:06 PM
  Undocumented L1D cache eventGabriele Svelto2019/05/06 12:08 AM
  Undocumented L1D cache eventRobert Williams2019/05/06 08:37 AM
    Undocumented L1D cache eventTravis Downs2019/05/06 12:58 PM
      It does have a useRobert Williams2019/05/16 09:23 AM
Reply to this Topic
Body: No Text
How do you spell green?