Why does writing to non-sequential lines in L2 perform so poorly?

By: Travis Downs (travis.downs.delete@this.gmail.com), June 13, 2020 3:18 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on December 22, 2017 1:16 pm wrote:

> You're right. Intel does document writeback L1, so my theory
> of "store buffer always drains to L2" must be garbage.
>
> So clearly the store buffer must drain to L1 if it exists.
>
> That doesn't invalidate the other part of the theory: the very act of the store
> buffer draining to two different cache levels might end up requiring the store
> buffer to be synchronized because of the store ordering visibility guarantees.
>
> But yeah, it's a pretty weak argument.
>
> So I'm probably wrong. I don't see what else would make
> the cache level switching matter for timing, though.
>
> Linus

I looked a bit more at this and I think you were basically right about this at least at the high level: it is an ordering problem. I don't think the stores drain directly to L2 though: that's not necessary to make this work.

Rather, it's a consequence of keeping the observability point (GO) of stores in order, combined with an optimization for stores that miss in certain matters that can remain ordered in the fill buffers.

A store to L1 hit becomes GO the moment it commits to L1 (the line is already in E/M state by definition). A store that misses in L1 allocates a fill buffer, and drains into that buffer. When the line returns, written bytes in the fill buffer are merged with the incoming data and committed to L1: this is the GO point for a miss (so really, it's the same as a hit: the moment the store gets committed to L1).

As an optimization on current Intel, immediately subsequent stores that miss to that same line can also drain into the same open fill buffer. Furthermore, subsequent consecutive stores that miss to a different line get their own fill buffer and can drain into that one. This is handy because it allows the store buffer to keep draining when there are many stores misses to consecutive lines (linear writes being obviously a fairly common pattern).

Critically, this only works as long as the store stream is to one line at a time, never returning back to the same line: AAAABBBBCDE is fine, because you have your A, B, C, D, E fill buffers, and the fill buffer order is consistent with the store order. If you have ABAB, however, there is no ordering of A and B fill buffers that will be consistent with that order: as soon as you commit the A buffer, for example, the second A store will appear to occur before the first B store. So it is not safe to merge the second A into the fill buffer created for the B store.

A similar thing occurs if "B" is not a miss to a different cache line, but an L1 hit to the same cache line: it's not safe to commit B to L1, because then it will become GO before the first A. So B has to stall. That's why the pattern ABCDEF is efficient: you can open 6 fill buffers for all six misses, while AXBXCXDXEXF, where X is an L1 hit, is not: you can only open one buffer at a time because the interleaved hits fence off the later misses to preserve ordering.

This is very clear with a new test I wrote which just strides though L2, writing every 32 bytes in two different patterns:


AABBCCDD pattern:

mov DWORD PTR [rdx],eax
mov DWORD PTR [rdx+0x20],eax
mov DWORD PTR [rdx+0x40],eax
mov DWORD PTR [rdx+0x60],eax
add rdx,0x80

ABABCDCD pattern:

mov DWORD PTR [rdx],eax
mov DWORD PTR [rdx+0x40],eax
mov DWORD PTR [rdx+0x20],eax
mov DWORD PTR [rdx+0x60],eax
add rdx,0x80


Those are identical except for the order of the two inner stores. They make the same number of writes, to the same location, touching the same two lines (per iteration), but the first write at offsets 0, 32, 64, 96, while the second changes the order to 0, 64, 32, 96.

The second runs roughly four times slower than the first: ~3 cycles vs ~12 cycles per store. 12 cycles is almost exactly the L2 latency, but I think that's a coincidence: the MLP factor here should be 2, not 1: since AB can go down in parallel (remember the stall happens at the second A in ABA), then when those come back the second AB can commit (directly to L1 this time), then CD go in parallel, etc.

So I think that's most of the mystery solved.

The above test writes to an L2-sized (64 KiB) buffer. If you make it a 1 MiB sized buffer (fits easily in L3, not in L2), performance doesn't get much worse for the ABAB case: it runs in ~14 cycles per store (6 cycles per store for AABB). That's prefetching kicking in at the L2: almost all of the additional latency to L3 is hidden by prefetching. If I disable prefetching, it's ~24 cycles per line, makes is consistent with our theory: that's the L3 latency (roughly 36 cycles on this box), divided by 2 (MLP factor 2), plus ~6 cycles for the overhead of committing the rest of the stores, opening the buffers, etc.

One open question: I thought Intel chips used RFO prefetch, where upcoming entries in the store buffer (other than the one at head) are examined to start fetching the line early - but if that existed we wouldn't see such poor performance here. Perhaps it just doesn't trigger in this case. Maybe RFO prefetch was a figment of my imagination or a patent I read, and most of the MLP actually comes from opening multiple fill buffers as described above.


< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 02:44 PM
  Bridges? Wells? (NT)Micahel S2017/12/20 03:53 PM
    Bridges? Wells? (NT)Travis2017/12/20 04:46 PM
      That should say "huh"? (NT)Travis2017/12/20 04:46 PM
        That should say "huh"?Jeff S.2017/12/20 05:11 PM
          That should say "huh"?Travis2017/12/20 06:34 PM
    Bridges? Wells?Jeff S.2017/12/20 05:17 PM
      Bridges? Wells?Travis2017/12/20 06:37 PM
    Bridges, Wells - positiveMichael S2017/12/21 02:52 AM
      Bridges, Wells - positiveTravis2017/12/21 09:35 AM
        Bridges, Wells - positiveMichael S2017/12/21 10:00 AM
  Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/20 06:18 PM
    Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 06:54 PM
      Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/21 12:12 PM
        Why does writing to non-sequential lines in L2 perform so poorly?anon2017/12/22 03:29 AM
          Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/22 01:16 PM
            Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/23 08:48 PM
            Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/13 03:18 PM
              Why does writing to non-sequential lines in L2 perform so poorly?John D. McCalpin2020/06/18 12:50 PM
                Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/18 05:32 PM
                  Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/18 05:34 PM
    Why does writing to non-sequential lines in L2 perform so poorly?anon.12017/12/21 06:09 PM
      Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/22 01:20 PM
        Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/24 02:09 PM
  Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 08:52 PM
    Why does writing to non-sequential lines in L2 perform so poorly?Adrian2017/12/21 12:09 AM
      Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/21 09:23 AM
    Why does writing to non-sequential lines in L2 perform so poorly?-.-2017/12/27 03:53 AM
      Why does writing to non-sequential lines in L2 perform so poorly?-.-2017/12/27 03:53 AM
        Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/27 04:18 PM
  Why does writing to non-sequential lines in L2 perform so poorly?Etienne2017/12/21 02:36 AM
    Why does writing to non-sequential lines in L2 perform so poorly?Michael S2017/12/21 02:58 AM
      Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/21 09:26 AM
        Michael ignore my last question - saw your other reply (NT)Travis2017/12/21 09:27 AM
  Why does writing to non-sequential lines in L2 perform so poorly?Nksingg2017/12/26 06:47 AM
    Why does writing to non-sequential lines in L2 perform so poorly?David Kanter2017/12/26 11:48 AM
    Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/27 04:33 PM
  Cannot reproduce with microcode 0xc6Travis Downs2019/02/26 04:23 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/26 09:35 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/26 10:07 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/27 05:02 AM
      Cannot reproduce with microcode 0xc6Travis Downs2019/02/27 08:25 AM
        Cannot reproduce with microcode 0xc6Adrian2019/02/28 01:16 AM
          Cannot reproduce with microcode 0xc6Travis Downs2019/03/07 06:51 PM
        Cannot reproduce with microcode 0xc6Adrian2019/02/28 09:54 AM
          Cannot reproduce with microcode 0xc6Travis Downs2019/03/24 06:34 PM
    Cannot reproduce with microcode 0xc6Travis Downs2019/02/27 03:20 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊