Why does writing to non-sequential lines in L2 perform so poorly?

By: Travis (travis.downs.delete@this.gmail.com), December 23, 2017 8:48 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on December 22, 2017 12:16 pm wrote:
> anon (anon.delete@this.ymous.net) on December 22, 2017 2:29 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on December 21, 2017 11:12 am wrote:
> > >
> > > So my thinking is that the behavior you see might be because
> > >
> > > (1) the store buffer drains purely to the L2, and the L2 is the real cache coherency
> > > boundary for external cores. The store ordering is easy to maintain because the stores
> > > really drain in order (although fetching the L2 lines can obviously be entirely OoO).
> >
> > When you say "real cache coherency boundary", you still agree that lines in L1D still need to be
> > invalidated, so either L2 is inclusive of L1D or invalidations will also go to L1D, correct?
>
> Obviously the L1 needs to be invalidated at some point, it just doesn't need to be involved
> in the cache coherency decisions because lower level caches are inclusive (I think for modern
> Intel cores it's L3 that is inclusive, not L2, but afaik that has changed over time).
>
> So as far as external cores are concerned, any L1 write ordering is entirely invisible.

Yes, on modern Intel L3 is inclusive, so it can act as a snoop filter for snoops from other cores - but the L1 isn't isolated from cache coherency concerns: if a RFO snoop from another core "hits" on a line in another core, the invalidate will have to go all the way up to the L1 of that other core. So the private L1 ordering does become visible to other cores in that way.

So that really seems to limit the ability to commit stores out of order, even to the L1 (of course committing stores only to the L1 is cheap so OoO commit doesn't help much there - there has to be a mix for it to be interesting) - unless you have some kind of mechanism where when a snoop comes in for a locally owned L1 line, you drain any older stores from the store buffer before responding. Most likely they just keep it simple and commit stores in-order.

So I think it is likely there are restrictions on store commit order even without invoking HT.

> But HT can see the L1 write ordering. Unless there is some per-core
> exclusion bit in the L1 tags, which there might well be.

Even if each logical core is committing in order to the L1 (as above), HT still causes problems: the private L1 means that a logical core might see writes by its sibling right out of order with respect to the global order (in the same way that store buffers mess things up: but that case is allowed in the memory model). To solve this, stores from one logical core have to snoop the load buffer of the sibling core and if they hit, the sibling core incurs a machine clear. These are visible using perf.

That latter case is too bad: normally you'd expect communication between sibling cores to be very fast, since it just has to pass through L1, opening up some interesting possibilities (e.g., prefetch threads, application-specific profiling, etc) - but the machine clear makes this expensive unless you can coordinate somehow so the loads and stores are somehow separated in time.


>
> > > (2) but to keep the L1 up-to-date, the L1 is updated separately from
> > > (and concurrently with) the store buffer if the line exists in there.
> >
> > Sounds like this would behave like "write no allocate, write through" for L1D. Intel has
> > always advertised writeback L1D (and many have raised an eyebrow at AMD for using write
> > through in Bulldozer...), so I'd be surprised if they were doing something like this.
>
> You're right. Intel does document writeback L1, so my theory
> of "store buffer always drains to L2" must be garbage.
>
> So clearly the store buffer must drain to L1 if it exists.

Indeed, you can test this: if the test is contained in L1, performance is exactly as you'd expect (1 store/cycle) regardless of the locality and pattern of stores.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 02:44 PM
  Bridges? Wells? (NT)Micahel S2017/12/20 03:53 PM
    Bridges? Wells? (NT)Travis2017/12/20 04:46 PM
      That should say "huh"? (NT)Travis2017/12/20 04:46 PM
        That should say "huh"?Jeff S.2017/12/20 05:11 PM
          That should say "huh"?Travis2017/12/20 06:34 PM
    Bridges? Wells?Jeff S.2017/12/20 05:17 PM
      Bridges? Wells?Travis2017/12/20 06:37 PM
    Bridges, Wells - positiveMichael S2017/12/21 02:52 AM
      Bridges, Wells - positiveTravis2017/12/21 09:35 AM
        Bridges, Wells - positiveMichael S2017/12/21 10:00 AM
  Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/20 06:18 PM
    Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 06:54 PM
      Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/21 12:12 PM
        Why does writing to non-sequential lines in L2 perform so poorly?anon2017/12/22 03:29 AM
          Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/22 01:16 PM
            Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/23 08:48 PM
            Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/13 03:18 PM
              Why does writing to non-sequential lines in L2 perform so poorly?John D. McCalpin2020/06/18 12:50 PM
                Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/18 05:32 PM
                  Why does writing to non-sequential lines in L2 perform so poorly?Travis Downs2020/06/18 05:34 PM
    Why does writing to non-sequential lines in L2 perform so poorly?anon.12017/12/21 06:09 PM
      Why does writing to non-sequential lines in L2 perform so poorly?Linus Torvalds2017/12/22 01:20 PM
        Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/24 02:09 PM
  Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/20 08:52 PM
    Why does writing to non-sequential lines in L2 perform so poorly?Adrian2017/12/21 12:09 AM
      Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/21 09:23 AM
    Why does writing to non-sequential lines in L2 perform so poorly?-.-2017/12/27 03:53 AM
      Why does writing to non-sequential lines in L2 perform so poorly?-.-2017/12/27 03:53 AM
        Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/27 04:18 PM
  Why does writing to non-sequential lines in L2 perform so poorly?Etienne2017/12/21 02:36 AM
    Why does writing to non-sequential lines in L2 perform so poorly?Michael S2017/12/21 02:58 AM
      Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/21 09:26 AM
        Michael ignore my last question - saw your other reply (NT)Travis2017/12/21 09:27 AM
  Why does writing to non-sequential lines in L2 perform so poorly?Nksingg2017/12/26 06:47 AM
    Why does writing to non-sequential lines in L2 perform so poorly?David Kanter2017/12/26 11:48 AM
    Why does writing to non-sequential lines in L2 perform so poorly?Travis2017/12/27 04:33 PM
  Cannot reproduce with microcode 0xc6Travis Downs2019/02/26 04:23 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/26 09:35 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/26 10:07 PM
    Cannot reproduce with microcode 0xc6Adrian2019/02/27 05:02 AM
      Cannot reproduce with microcode 0xc6Travis Downs2019/02/27 08:25 AM
        Cannot reproduce with microcode 0xc6Adrian2019/02/28 01:16 AM
          Cannot reproduce with microcode 0xc6Travis Downs2019/03/07 06:51 PM
        Cannot reproduce with microcode 0xc6Adrian2019/02/28 09:54 AM
          Cannot reproduce with microcode 0xc6Travis Downs2019/03/24 06:34 PM
    Cannot reproduce with microcode 0xc6Travis Downs2019/02/27 03:20 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?