By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), December 20, 2017 5:18 pm
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on December 20, 2017 1:44 pm wrote:
>
> That means the extra store to L1 takes a net of 6-15 cycles!
Wild guess: it's about store ordering guarantees, and the L1 hit store being done concurrently with (or perhaps even instead of) the store buffer for some efficiency reason.
[ Wild hand-waving commences - I have absolutely nothing to back this up with ]
When the first store misses in the L2, and the second store hits in a line that already exists in the L1 cache, you have a nasty situation: you can not afford to show the second store in the cache hierarchy before the first one, because that would violate x86 memory ordering rules.
So the second store now needs to be delayed until the first store is actually visible in the cache hierarchy. And the way they do that is to flush the write buffer between the two stores.
In contrast, if you have two stores to the same (missed) cacheline, the second store just goes to the store buffer and expands on the previous entry, and there is no need to synchronize anything until the store buffer gets full.
So the "just fill up store buffer" case is limited by L2 cache access throughput, but the "mixed L1 and L2 accesses" case has serialization issues.
Linus
>
> That means the extra store to L1 takes a net of 6-15 cycles!
Wild guess: it's about store ordering guarantees, and the L1 hit store being done concurrently with (or perhaps even instead of) the store buffer for some efficiency reason.
[ Wild hand-waving commences - I have absolutely nothing to back this up with ]
When the first store misses in the L2, and the second store hits in a line that already exists in the L1 cache, you have a nasty situation: you can not afford to show the second store in the cache hierarchy before the first one, because that would violate x86 memory ordering rules.
So the second store now needs to be delayed until the first store is actually visible in the cache hierarchy. And the way they do that is to flush the write buffer between the two stores.
In contrast, if you have two stores to the same (missed) cacheline, the second store just goes to the store buffer and expands on the previous entry, and there is no need to synchronize anything until the store buffer gets full.
So the "just fill up store buffer" case is limited by L2 cache access throughput, but the "mixed L1 and L2 accesses" case has serialization issues.
Linus