Why does writing to non-sequential lines in L2 perform so poorly?

By: John D. McCalpin (john.delete@this.mccalpin.com), June 18, 2020 12:50 pm
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on June 13, 2020 3:18 pm wrote:
>
> One open question: I thought Intel chips used RFO prefetch, where upcoming entries in the
> store buffer (other than the one at head) are examined to start fetching the line early -
> but if that existed we wouldn't see such poor performance here. Perhaps it just doesn't trigger
> in this case. Maybe RFO prefetch was a figment of my imagination or a patent I read, and most
> of the MLP actually comes from opening multiple fill buffers as described above.

L2 RFO prefetches to LLC and to L2 definitely exist -- it is less clear how often they are generated....

In Chapter 3 of the Xeon Scalable Processor Family Uncore Performance Monitoring Reference Manual (document 336274-001, July 2017), Table 3-1 contains the opcodes available for matching at the CHA box. Opcode 0x258 is LlcPrefRFO, described as an "LLC Prefetch RFO". This opcode is also used in four derived metrics described in section 2.2.9 -- all of which have descriptions that match what one would expect for an LLC RFO Prefetch. One reason that I am cautious about how often they might be generated is that the default configuration for SKX/CLX processors disables HW PF to L3.

There are no opcodes in Chapter 3 that describe an IDI opcode that one might call an L2 RFO PF. Due to the more "possessive" nature of RFOs, I would guess that an RFO generated by an L2 HW prefetcher would use the demand RFO transaction type. Testing this will require determining what patterns generate HW RFO Prefetches and then fiddling with the patterns until I can control how many "spurious" RFO prefetches are generated (i.e., those not followed up by a demand RFO).

In Volume 3 of the Intel SWDM (document 325384-071, October 2019), Table 18-43 in Section 18.3.8.2 shows "request types" for the "offcore response" performance counter event in the core. This table allows you to select "DEMAND_RFO", "PF_L2_RFO", and "PF_L3_RFO". These OFFCORE_RESPONSE events are recorded in the L2, so it makes sense that they can track PF_L2_RFO independent of DEMAND_RFO, even if those both use the same opcode when sending requests to the CHA/LLC (0x250 is the demand RFO opcode).

Interestingly, in the OFFCORE_RESPONSE event request type field, there is a single bit for "PF_L1D_AND_SW" -- L1 Data Cache Hardware Prefetches and Software Prefetches. The description of the L1 HW prefetchers in the Intel Optimization Reference Manual (document 248966-042b, September 2019), section 2.5.5.4 seems very clear that these are only for loads, while the L2 HW prefetchers can respond to sequences of loads or stores. It will be interesting to see if PREFETCHW (prefetch with intent to write) is counted like a read or like an RFO. Table 3-7 in the SKX Uncore manual shows that UPI includes both "InvItoE" and "InvItoM" transactions. Both are "coherence-only" transactions (i.e., the core already has a valid copy of the line), but the former requests an upgrade to E state, while the latter requests an upgrade directly to M state. A PREFETCHW instruction could be implemented either way. (In the olden days, MIPS was missing a prefetch instruction that gave the normal E-state read behavior -- you could either prefetch shared (in which case you had to do an upgrade before writing) or you could prefetch modified (in which case the data had to be written back even if not modified. This made it difficult to use the same code for different array sizes....)


< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 02:44 PM
  Bridges? Wells? (NT)Micahel S2017/12/20 03:53 PM
    Bridges? Wells? (NT)Travis2017/12/20 04:46 PM
      That should say "huh"? (NT)Travis2017/12/20 04:46 PM
        That should say "huh"?Jeff S.2017/12/20 05:11 PM
          That should say "huh"?Travis2017/12/20 06:34 PM
    Bridges? Wells?Jeff S.2017/12/20 05:17 PM
      Bridges? Wells?Travis2017/12/20 06:37 PM
    Bridges, Wells - positiveMichael S2017/12/21 02:52 AM
      Bridges, Wells - positiveTravis2017/12/21 09:35 AM
        Bridges, Wells - positiveMichael S2017/12/21 10:00 AM
  Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/20 06:18 PM
    Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 06:54 PM
      Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/21 12:12 PM
        Why does writing to non-sequential lines in L2 perform so poorly?anon2017/12/22 03:29 AM
          Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/22 01:16 PM
            Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/23 08:48 PM
            Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/13 03:18 PM
              Why does writing to non-sequential lines in L2 perform so poorly?John D. McCalpin2020/06/18 12:50 PM
                Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/18 05:32 PM
                  Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/18 05:34 PM
    Why does writing to non-sequential lines in L2 perform so poorly?anon.12017/12/21 06:09 PM
      Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/22 01:20 PM
        Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/24 02:09 PM
  Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 08:52 PM
    Why does writing to non-sequential lines in L2 perform so poorly?Adrian2017/12/21 12:09 AM
      Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/21 09:23 AM
    Why does writing to non-sequential lines in L2 perform so poorly?-.-2017/12/27 03:53 AM
      Why does writing to non-sequential lines in L2 perform so poorly?-.-2017/12/27 03:53 AM
        Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/27 04:18 PM
  Why does writing to non-sequential lines in L2 perform so poorly?Etienne2017/12/21 02:36 AM
    Why does writing to non-sequential lines in L2 perform so poorly?Michael S2017/12/21 02:58 AM
      Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/21 09:26 AM
        Michael ignore my last question - saw your other reply (NT)Travis2017/12/21 09:27 AM
  Why does writing to non-sequential lines in L2 perform so poorly?Nksingg2017/12/26 06:47 AM
    Why does writing to non-sequential lines in L2 perform so poorly?David Kanter2017/12/26 11:48 AM
    Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/27 04:33 PM
  Cannot reproduce with microcode 0xc6Travis Downs2019/02/26 04:23 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/26 09:35 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/26 10:07 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/27 05:02 AM
      Cannot reproduce with microcode 0xc6Travis Downs2019/02/27 08:25 AM
        Cannot reproduce with microcode 0xc6Adrian2019/02/28 01:16 AM
          Cannot reproduce with microcode 0xc6Travis Downs2019/03/07 06:51 PM
        Cannot reproduce with microcode 0xc6Adrian2019/02/28 09:54 AM
          Cannot reproduce with microcode 0xc6Travis Downs2019/03/24 06:34 PM
    Cannot reproduce with microcode 0xc6Travis Downs2019/02/27 03:20 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?