By: Jörn Engel (joern.delete@this.purestorage.com), September 23, 2021 9:24 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on September 23, 2021 5:55 am wrote:
> Jörn Engel (joern.delete@this.purestorage.com) on September 23, 2021 5:10 am wrote:
> > Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:
> > >
> > > Not sure. I'll have to play around with the code a bit.
> >
> > Looks like I have to eat my words.
>
> Including your criticism of Brett's suggestion?
Jupp. The numbers strongly indicate that numbers of L1 cachelines touched is a dominant bottleneck once the rest of the computation is cheap enough. If you need to access L2 or do enough compute to move the bottleneck, my old assertion is probably still true. It will take a while to sort out which parts of my model to abandon and which to keep.
> > Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were
> > competing for the same cachelines. Modulo the offset the two were exactly 16k apart.
> >
> > If someone has a good explanation, I'd love to hear it.
>
> I don't like your hypothesis. The cache has 8 ways, that's a lot
> for a simple loop that in theory should be happy with 2 ways.
> But I don't have any better theory.
> May be, it's something related to store-to-load predictor mistakenly
> predicting a hit that causes a replay down the road?
I don't like my hypothesis either. Photons "behave as if" they were waves in a medium, but probably aren't. Same for my results, I can describe the behavior by analogy, but have no idea what the true nature is.
> If you publish your source code then people could try on different CPUs. Close relatives
> of Skylake, i.e. Haswell, Broadwell and SKX, would be the most interesting.
It's currently joined at the hip with my vector library. Give me some time to clean things up. Or you can modify your own benchmark if you are impatient.
> Jörn Engel (joern.delete@this.purestorage.com) on September 23, 2021 5:10 am wrote:
> > Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:
> > >
> > > Not sure. I'll have to play around with the code a bit.
> >
> > Looks like I have to eat my words.
>
> Including your criticism of Brett's suggestion?
Jupp. The numbers strongly indicate that numbers of L1 cachelines touched is a dominant bottleneck once the rest of the computation is cheap enough. If you need to access L2 or do enough compute to move the bottleneck, my old assertion is probably still true. It will take a while to sort out which parts of my model to abandon and which to keep.
> > Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were
> > competing for the same cachelines. Modulo the offset the two were exactly 16k apart.
> >
> > If someone has a good explanation, I'd love to hear it.
>
> I don't like your hypothesis. The cache has 8 ways, that's a lot
> for a simple loop that in theory should be happy with 2 ways.
> But I don't have any better theory.
> May be, it's something related to store-to-load predictor mistakenly
> predicting a hit that causes a replay down the road?
I don't like my hypothesis either. Photons "behave as if" they were waves in a medium, but probably aren't. Same for my results, I can describe the behavior by analogy, but have no idea what the true nature is.
> If you publish your source code then people could try on different CPUs. Close relatives
> of Skylake, i.e. Haswell, Broadwell and SKX, would be the most interesting.
It's currently joined at the hip with my vector library. Give me some time to clean things up. Or you can modify your own benchmark if you are impatient.