By: Travis (travis.downs.delete@this.gmail.com), December 21, 2017 8:26 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on December 21, 2017 1:58 am wrote:
> Etienne (etienne_lorrain.delete@this.yahoo.fr) on December 21, 2017 1:36 am wrote:
> > Travis (travis.downs.delete@this.gmail.com) on December 20, 2017 1:44 pm wrote:
> > > What feature of the L1L2 path on x86 could cause this?
> >
> > Maybe it could be that you are writing one single word into an L2 cacheline, so the rest of that
> > cacheline has to be fetched from L3/main memory and that takes time / cannot be completely hidden?
>
> Sure, but it's the same regardless of the store to L1D-resident line in the middle.
> And it only takes 3.5 clocks (on Skylake. On Wells and bridges it takes 6-6.5 clocks).
>
Also, the whole data set is 64 KiB so it fits in L2. Even though I'm only writing a word, the whole cache line is either in L2 or not, so it's not like some of the word comes from L2 and some comes from a higher cache level.
In any case, I've run tests were I overwrite the whole line (e.g., via 2 32-byte stores, and various other combinations) and it doesn't change the results.
@Michael - what takes 6-6.5 on Wells/Bridges? The version with a single write to L2 and not intervening write?
> Etienne (etienne_lorrain.delete@this.yahoo.fr) on December 21, 2017 1:36 am wrote:
> > Travis (travis.downs.delete@this.gmail.com) on December 20, 2017 1:44 pm wrote:
> > > What feature of the L1L2 path on x86 could cause this?
> >
> > Maybe it could be that you are writing one single word into an L2 cacheline, so the rest of that
> > cacheline has to be fetched from L3/main memory and that takes time / cannot be completely hidden?
>
> Sure, but it's the same regardless of the store to L1D-resident line in the middle.
> And it only takes 3.5 clocks (on Skylake. On Wells and bridges it takes 6-6.5 clocks).
>
Also, the whole data set is 64 KiB so it fits in L2. Even though I'm only writing a word, the whole cache line is either in L2 or not, so it's not like some of the word comes from L2 and some comes from a higher cache level.
In any case, I've run tests were I overwrite the whole line (e.g., via 2 32-byte stores, and various other combinations) and it doesn't change the results.
@Michael - what takes 6-6.5 on Wells/Bridges? The version with a single write to L2 and not intervening write?