By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), December 22, 2017 12:20 pm
Room: Moderated Discussions
anon.1 (abc.delete@this.def.com) on December 21, 2017 5:09 pm wrote:
>
> Another hypothesis related to store ordering. When you store contiguous locations, the store buffer can
> combine them into one write to L1. So address A, A+4, A+8 can be combined assuming there were no coherence
> snoop on that line (this is part of the advantage of a store buffer). If you interleave it as A, B, A+4,
> B, x86 ordering rules no longer allow the stores to be observed out-of-order. So in the first case you
> likely never exhaust the store buffer whereas in the second case, you do.
I like this theory.
Yes. Doing writes to the same cacheline (or whatever the store buffer entry granularity is - it's likely the same width as the cache access width, rather than the cacheline width) allows you to just merge them in the same store buffer entry with a byte mask. So you're right, the first case has a much bigger effective store buffer.
That should be fairly easy for Travis to check - replace the L1 store with a store to another L2 line, and see the timing behavior remains.
Linus
>
> Another hypothesis related to store ordering. When you store contiguous locations, the store buffer can
> combine them into one write to L1. So address A, A+4, A+8 can be combined assuming there were no coherence
> snoop on that line (this is part of the advantage of a store buffer). If you interleave it as A, B, A+4,
> B, x86 ordering rules no longer allow the stores to be observed out-of-order. So in the first case you
> likely never exhaust the store buffer whereas in the second case, you do.
I like this theory.
Yes. Doing writes to the same cacheline (or whatever the store buffer entry granularity is - it's likely the same width as the cache access width, rather than the cacheline width) allows you to just merge them in the same store buffer entry with a byte mask. So you're right, the first case has a much bigger effective store buffer.
That should be fairly easy for Travis to check - replace the L1 store with a store to another L2 line, and see the timing behavior remains.
Linus