By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), December 22, 2017 12:16 pm
Room: Moderated Discussions
anon (anon.delete@this.ymous.net) on December 22, 2017 2:29 am wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on December 21, 2017 11:12 am wrote:
> >
> > So my thinking is that the behavior you see might be because
> >
> > (1) the store buffer drains purely to the L2, and the L2 is the real cache coherency
> > boundary for external cores. The store ordering is easy to maintain because the stores
> > really drain in order (although fetching the L2 lines can obviously be entirely OoO).
>
> When you say "real cache coherency boundary", you still agree that lines in L1D still need to be
> invalidated, so either L2 is inclusive of L1D or invalidations will also go to L1D, correct?
Obviously the L1 needs to be invalidated at some point, it just doesn't need to be involved in the cache coherency decisions because lower level caches are inclusive (I think for modern Intel cores it's L3 that is inclusive, not L2, but afaik that has changed over time).
So as far as external cores are concerned, any L1 write ordering is entirely invisible.
But HT can see the L1 write ordering. Unless there is some per-core exclusion bit in the L1 tags, which there might well be.
> > (2) but to keep the L1 up-to-date, the L1 is updated separately from
> > (and concurrently with) the store buffer if the line exists in there.
>
> Sounds like this would behave like "write no allocate, write through" for L1D. Intel has
> always advertised writeback L1D (and many have raised an eyebrow at AMD for using write
> through in Bulldozer...), so I'd be surprised if they were doing something like this.
You're right. Intel does document writeback L1, so my theory of "store buffer always drains to L2" must be garbage.
So clearly the store buffer must drain to L1 if it exists.
That doesn't invalidate the other part of the theory: the very act of the store buffer draining to two different cache levels might end up requiring the store buffer to be synchronized because of the store ordering visibility guarantees.
But yeah, it's a pretty weak argument.
So I'm probably wrong. I don't see what else would make the cache level switching matter for timing, though.
Linus
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on December 21, 2017 11:12 am wrote:
> >
> > So my thinking is that the behavior you see might be because
> >
> > (1) the store buffer drains purely to the L2, and the L2 is the real cache coherency
> > boundary for external cores. The store ordering is easy to maintain because the stores
> > really drain in order (although fetching the L2 lines can obviously be entirely OoO).
>
> When you say "real cache coherency boundary", you still agree that lines in L1D still need to be
> invalidated, so either L2 is inclusive of L1D or invalidations will also go to L1D, correct?
Obviously the L1 needs to be invalidated at some point, it just doesn't need to be involved in the cache coherency decisions because lower level caches are inclusive (I think for modern Intel cores it's L3 that is inclusive, not L2, but afaik that has changed over time).
So as far as external cores are concerned, any L1 write ordering is entirely invisible.
But HT can see the L1 write ordering. Unless there is some per-core exclusion bit in the L1 tags, which there might well be.
> > (2) but to keep the L1 up-to-date, the L1 is updated separately from
> > (and concurrently with) the store buffer if the line exists in there.
>
> Sounds like this would behave like "write no allocate, write through" for L1D. Intel has
> always advertised writeback L1D (and many have raised an eyebrow at AMD for using write
> through in Bulldozer...), so I'd be surprised if they were doing something like this.
You're right. Intel does document writeback L1, so my theory of "store buffer always drains to L2" must be garbage.
So clearly the store buffer must drain to L1 if it exists.
That doesn't invalidate the other part of the theory: the very act of the store buffer draining to two different cache levels might end up requiring the store buffer to be synchronized because of the store ordering visibility guarantees.
But yeah, it's a pretty weak argument.
So I'm probably wrong. I don't see what else would make the cache level switching matter for timing, though.
Linus