Why does writing to non-sequential lines in L2 perform so poorly?

By: Travis (travis.downs.delete@this.gmail.com), December 20, 2017 2:44 pm
Room: Moderated Discussions
I am looking at an unexpected performance issue on Skylake.

Consider the following loop which strides by 64 bytes writing a DWORD:


.top:
mov DWORD PTR [rdx],eax
add rdx,0x40
sub rdi,0x1
jne .top


If the region touched by the store is 64 KiB (fits in L2, not in L1) this runs at about 3 to 3.5 cycles per store (the better figure being obtainable if you disable the L2 stream prefetcher which competes for the L2), steady-state in a loop (the not-steady-state perhaps could be better if L1 is mostly not dirty). That seems reasonable when you consider the cycles needed to evict lines from L1 and bring in lines from L2 (perhaps one cycle each) and to do the actual store (perhaps one cycle).

If you add another store to the same line:


.top:
mov DWORD PTR [rdx],eax
mov DWORD PTR [rdx + 4],eax
add rdx,0x40
sub rdi,0x1
jne .top


Nothing much changes: it takes 0.5 cycles longer on average - although it is sometimes just as fast and sometimes 1 cycle longer (there are peaks at all 3 values, but 0.5 longer occurs about 90% of the time). If you add further stores to the same line each store takes about 1 cycle longer.

All this makes sense: we know stores are limited on x86 to one per cycle if they hit in L1, and we expect stores to L2 to take longer due to the lower bandwidth and probably limited ports (1 shared r/w port?). The first additional store usually taking only 0.5 cycles extra is probably because some work is overlapped the other store that misses.

So if you change that second store to be to a fixed location always in L1, what would you expect? An extra cycle at most?


.top:
mov DWORD PTR [rdx],eax
mov DWORD PTR [rsp - 8],eax
add rdx,0x40
sub rdi,0x1
jne .top


This runs at either 9 or 18 cycles per iteration (alternating weirdly between the two modes: often switching back and forth every few seconds, sometimes running for a minute or more in the fast or slow mode).

That means the extra store to L1 takes a net of 6-15 cycles!

The problem only occurs with stores: if you do a dummy load or any type of prefetch to bring the first store (that misses) into L1 first, the problem goes away and both stores execute in a total of 3-3.5 cycles (i.e., the second store is almost "free", as expected).

What feature of the L1L2 path on x86 could cause this? It isn't clear to me, for example, exactly how the L2 -> L1 transfers and L1 evictions interact with the line fill buffers and store buffer. When is a store missed "noticed"? It's clear that a load pretty much allocates a LFB when the load executes (after the address is ready) and the L1 probe misses, and the performance behavior more or less makes sense based on what we know about LFB numbers and so on. When does an LFB get allocated for a store miss though? In the same way, or later when the store is getting ready to commit from the store buffer?

My particular issue is on x86 but I'm also curious on insights into this on other platforms (e.g., does the total store ordering of the x86 make this tougher to implement quickly)?
 Next Post in Thread >
TopicPosted ByDate
Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 02:44 PM
  Bridges? Wells? (NT)Micahel S2017/12/20 03:53 PM
    Bridges? Wells? (NT)Travis2017/12/20 04:46 PM
      That should say "huh"? (NT)Travis2017/12/20 04:46 PM
        That should say "huh"?Jeff S.2017/12/20 05:11 PM
          That should say "huh"?Travis2017/12/20 06:34 PM
    Bridges? Wells?Jeff S.2017/12/20 05:17 PM
      Bridges? Wells?Travis2017/12/20 06:37 PM
    Bridges, Wells - positiveMichael S2017/12/21 02:52 AM
      Bridges, Wells - positiveTravis2017/12/21 09:35 AM
        Bridges, Wells - positiveMichael S2017/12/21 10:00 AM
  Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/20 06:18 PM
    Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 06:54 PM
      Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/21 12:12 PM
        Why does writing to non-sequential lines in L2 perform so poorly?anon2017/12/22 03:29 AM
          Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/22 01:16 PM
            Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/23 08:48 PM
            Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/13 03:18 PM
              Why does writing to non-sequential lines in L2 perform so poorly?John D. McCalpin2020/06/18 12:50 PM
                Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/18 05:32 PM
                  Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/18 05:34 PM
    Why does writing to non-sequential lines in L2 perform so poorly?anon.12017/12/21 06:09 PM
      Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/22 01:20 PM
        Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/24 02:09 PM
  Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 08:52 PM
    Why does writing to non-sequential lines in L2 perform so poorly?Adrian2017/12/21 12:09 AM
      Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/21 09:23 AM
    Why does writing to non-sequential lines in L2 perform so poorly?-.-2017/12/27 03:53 AM
      Why does writing to non-sequential lines in L2 perform so poorly?-.-2017/12/27 03:53 AM
        Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/27 04:18 PM
  Why does writing to non-sequential lines in L2 perform so poorly?Etienne2017/12/21 02:36 AM
    Why does writing to non-sequential lines in L2 perform so poorly?Michael S2017/12/21 02:58 AM
      Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/21 09:26 AM
        Michael ignore my last question - saw your other reply (NT)Travis2017/12/21 09:27 AM
  Why does writing to non-sequential lines in L2 perform so poorly?Nksingg2017/12/26 06:47 AM
    Why does writing to non-sequential lines in L2 perform so poorly?David Kanter2017/12/26 11:48 AM
    Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/27 04:33 PM
  Cannot reproduce with microcode 0xc6Travis Downs2019/02/26 04:23 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/26 09:35 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/26 10:07 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/27 05:02 AM
      Cannot reproduce with microcode 0xc6Travis Downs2019/02/27 08:25 AM
        Cannot reproduce with microcode 0xc6Adrian2019/02/28 01:16 AM
          Cannot reproduce with microcode 0xc6Travis Downs2019/03/07 06:51 PM
        Cannot reproduce with microcode 0xc6Adrian2019/02/28 09:54 AM
          Cannot reproduce with microcode 0xc6Travis Downs2019/03/24 06:34 PM
    Cannot reproduce with microcode 0xc6Travis Downs2019/02/27 03:20 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?