Why does writing to non-sequential lines in L2 perform so poorly?

By: Travis Downs (travis.downs.delete@this.gmail.com), June 18, 2020 5:32 pm
Room: Moderated Discussions
John D. McCalpin (john.delete@this.mccalpin.com) on June 18, 2020 12:50 pm wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on June 13, 2020 3:18 pm wrote:
> >
> > One open question: I thought Intel chips used RFO prefetch, where upcoming entries in the
> > store buffer (other than the one at head) are examined to start fetching the line early -
> > but if that existed we wouldn't see such poor performance here. Perhaps it just doesn't trigger
> > in this case. Maybe RFO prefetch was a figment of my imagination or a patent I read, and most
> > of the MLP actually comes from opening multiple fill buffers as described above.
>
> L2 RFO prefetches to LLC and to L2 definitely exist -- it is less clear how often they are generated....

Sorry, I sent you off on a bit of a wild goose chase. I perhaps used the wrong term, but when I said "RFO prefetch" what I really meant is a core-level prefetcher near the store buffer-L1D interface, that looks ahead into the store buffer and fetches lines that aren't in the L1D before that buffer gets to the head of the queue.

These are not like other types of prefetches, which are inherently speculative (they don't know the line will be accessed for sure): in this case the cores knows this line will be needed (at least as long as it only looks at the retired/senior part of the store buffer). The only risk is that a line that is fetched ahead of time will be invalidated by another core and lost, before it is time to store to that line (or the line may be evicted from L1D due to subsquently incoming lines in the same set, but that seems unlikely).

Intel patents describe exactly a mechanism to do this, along with a predictor that turns it off in the case of frequently losing the line before it can be stored to.

I was calling this "RFO prefetch" buts lets call it "store buffer lookahead" (SBH) instead.

A benchmark of loads to random locations runs much faster, per store, than the memory latency. For example, I get about 7 ns per store to random large region, vs memory latency of about 55 ns, which implies a high amount of MLP. I thought this SBH was the mechanism for getting this MLP: the lookahead allows the RFOs for several stores to be in progress at once, something that a "only the head store at a time" wouldn't allow. I made it part of my mental model for the store path.

Now, as discussed in my previous post, I think that stores also allocate and drain into fill buffers when they miss, and there can be several such buffers for several stores misses, if you avoid the ABAB problem. This provides another mechanism to get the observed MLP. That's what I mean by "I'm not sure about RFO prefetch". Sorry for any confusion.


The other type of store prefetching that you describe definitely exists. You get giant performance differences when you disable the L2 streamer for load-only workflows. If I had to guess, it is the same L2 streamer as used for loads, but it just notes if the loads are RFO or not, so it can issue the right type of request "downstream".

> clear that these are only for loads, while the L2 HW prefetchers can respond to sequences of loads or stores.
> It will be interesting to see if PREFETCHW (prefetch with intent to write) is counted like a read or like
> an RFO. Table 3-7 in the SKX Uncore manual shows that UPI includes both "InvItoE" and "InvItoM" transactions.
> Both are "coherence-only" transactions (i.e., the core already has a valid copy of the line), but the former
> requests an upgrade to E state, while the latter requests an upgrade directly to M state. A PREFETCHW instruction
> could be implemented either way. (In the olden days, MIPS was missing a prefetch instruction that gave the
> normal E-state read behavior -- you could either prefetch shared (in which case you had to do an upgrade before
> writing) or you could prefetch modified (in which case the data had to be written back even if not modified.
> This made it difficult to use the same code for different array sizes....)

At least for L2_RQSTS counter, I find that prefetchw instructions are grouped with normal demand RFOs. I.e., SW prefetchw is not distinguishable from a demand store RFO. This is in contrast to other SW prefetch instructions which *do* get their own umask bits. That's undocumented and based on my own tests, so applies most directly to Skylake-S and I'm not totally sure it applies to SKX with its different L2, but I guess yes.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 02:44 PM
  Bridges? Wells? (NT)Micahel S2017/12/20 03:53 PM
    Bridges? Wells? (NT)Travis2017/12/20 04:46 PM
      That should say "huh"? (NT)Travis2017/12/20 04:46 PM
        That should say "huh"?Jeff S.2017/12/20 05:11 PM
          That should say "huh"?Travis2017/12/20 06:34 PM
    Bridges? Wells?Jeff S.2017/12/20 05:17 PM
      Bridges? Wells?Travis2017/12/20 06:37 PM
    Bridges, Wells - positiveMichael S2017/12/21 02:52 AM
      Bridges, Wells - positiveTravis2017/12/21 09:35 AM
        Bridges, Wells - positiveMichael S2017/12/21 10:00 AM
  Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/20 06:18 PM
    Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 06:54 PM
      Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/21 12:12 PM
        Why does writing to non-sequential lines in L2 perform so poorly?anon2017/12/22 03:29 AM
          Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/22 01:16 PM
            Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/23 08:48 PM
            Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/13 03:18 PM
              Why does writing to non-sequential lines in L2 perform so poorly?John D. McCalpin2020/06/18 12:50 PM
                Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/18 05:32 PM
                  Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/18 05:34 PM
    Why does writing to non-sequential lines in L2 perform so poorly?anon.12017/12/21 06:09 PM
      Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/22 01:20 PM
        Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/24 02:09 PM
  Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 08:52 PM
    Why does writing to non-sequential lines in L2 perform so poorly?Adrian2017/12/21 12:09 AM
      Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/21 09:23 AM
    Why does writing to non-sequential lines in L2 perform so poorly?-.-2017/12/27 03:53 AM
      Why does writing to non-sequential lines in L2 perform so poorly?-.-2017/12/27 03:53 AM
        Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/27 04:18 PM
  Why does writing to non-sequential lines in L2 perform so poorly?Etienne2017/12/21 02:36 AM
    Why does writing to non-sequential lines in L2 perform so poorly?Michael S2017/12/21 02:58 AM
      Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/21 09:26 AM
        Michael ignore my last question - saw your other reply (NT)Travis2017/12/21 09:27 AM
  Why does writing to non-sequential lines in L2 perform so poorly?Nksingg2017/12/26 06:47 AM
    Why does writing to non-sequential lines in L2 perform so poorly?David Kanter2017/12/26 11:48 AM
    Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/27 04:33 PM
  Cannot reproduce with microcode 0xc6Travis Downs2019/02/26 04:23 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/26 09:35 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/26 10:07 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/27 05:02 AM
      Cannot reproduce with microcode 0xc6Travis Downs2019/02/27 08:25 AM
        Cannot reproduce with microcode 0xc6Adrian2019/02/28 01:16 AM
          Cannot reproduce with microcode 0xc6Travis Downs2019/03/07 06:51 PM
        Cannot reproduce with microcode 0xc6Adrian2019/02/28 09:54 AM
          Cannot reproduce with microcode 0xc6Travis Downs2019/03/24 06:34 PM
    Cannot reproduce with microcode 0xc6Travis Downs2019/02/27 03:20 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?