By: Michael S (already5chosen.delete@this.yahoo.com), September 23, 2021 4:55 am
Room: Moderated Discussions
Jörn Engel (joern.delete@this.purestorage.com) on September 23, 2021 5:10 am wrote:
> Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:
> >
> > Not sure. I'll have to play around with the code a bit.
>
> Looks like I have to eat my words.
Including your criticism of Brett's suggestion?
> As usual, it was easier to write my own benchmark
> than modify yours. The results roughly match. But to mix things up I tried a copy
> using AVX2 (technically just AVX, I think). Here things get interesting.
>
> If I add an offset to both src and dst there is a 2x performance difference.
> Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to
> have kicked in, making the numbers look a bit better than they should.
>
> 0: 23418
> 1: 42909
>
> 31: 42798
> 32: 21552
> 33: 43128
>
> 63: 43146
> 64: 21552
> 65: 42657
>
> 95: 42666
> 96: 21507
> 97: 43239
>
> 127: 43233
> 128: 21483
> 129: 42795
>
> So what happens if we keep dst aligned and only shift the src?
>
> 0: 21642
> 1: 26088
>
> 31: 25268
> 32: 19208
> 33: 25194
>
> 63: 25338
> 64: 19474
> 65: 25440
>
> 95: 24962
> 96: 19206
> 97: 25120
>
> 127: 25104
> 128: 19214
> 129: 25054
>
> We get a small speedup for the aligned cases. Probably a red herring because I didn't fix
> the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads
> and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.
>
> Ok, now let's keep src aligned and only shift dst.
>
> 0: 24153
> 1: 128886
>
> 31: 127698
> 32: 22206
> 33: 120669
>
> 63: 113868
> 64: 20596
> 65: 79906
>
> 95: 80180
> 96: 20392
> 97: 62200
>
> 127: 63056
> 128: 20636
> 129: 56148
>
> 159: 55502
> 160: 20600
> 161: 46962
>
> 191: 46856
> 192: 20394
> 193: 44454
>
> 223: 44404
> 224: 20284
> 225: 39622
>
> This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the
> unaligned performance appears to improve, with a noticeable step each time we do another aligned
> round. Towards the end I see the performance results I would have expected throughout.
>
> If I copy the entire benchmark loop a few times, I get the same results back to back.
> So this is not a warmup-problem, the offsets between src and dst appear to matter.
> So finally I shifted src by 256 bytes and now I get reasonable results again.
>
> 0: 20481
> 1: 39663
>
> 31: 38526
> 32: 19767
> 33: 38526
>
> 63: 38523
> 64: 19764
> 65: 38529
>
> 95: 38523
> 96: 19767
> 97: 38520
>
> 127: 38523
> 128: 19668
> 129: 38517
>
> Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were
> competing for the same cachelines. Modulo the offset the two were exactly 16k apart.
>
> If someone has a good explanation, I'd love to hear it.
I don't like your hypothesis. The cache has 8 ways, that's a lot for a simple loop that in theory should be happy with 2 ways.
But I don't have any better theory.
May be, it's something related to store-to-load predictor mistakenly predicting a hit that causes a replay down the road?
If you publish your source code then people could try on different CPUs. Close relatives of Skylake, i.e. Haswell, Broadwell and SKX, would be the most interesting.
> Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:
> >
> > Not sure. I'll have to play around with the code a bit.
>
> Looks like I have to eat my words.
Including your criticism of Brett's suggestion?
> As usual, it was easier to write my own benchmark
> than modify yours. The results roughly match. But to mix things up I tried a copy
> using AVX2 (technically just AVX, I think). Here things get interesting.
>
> If I add an offset to both src and dst there is a 2x performance difference.
> Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to
> have kicked in, making the numbers look a bit better than they should.
>
> 0: 23418
> 1: 42909
>
> 31: 42798
> 32: 21552
> 33: 43128
>
> 63: 43146
> 64: 21552
> 65: 42657
>
> 95: 42666
> 96: 21507
> 97: 43239
>
> 127: 43233
> 128: 21483
> 129: 42795
>
> So what happens if we keep dst aligned and only shift the src?
>
> 0: 21642
> 1: 26088
>
> 31: 25268
> 32: 19208
> 33: 25194
>
> 63: 25338
> 64: 19474
> 65: 25440
>
> 95: 24962
> 96: 19206
> 97: 25120
>
> 127: 25104
> 128: 19214
> 129: 25054
>
> We get a small speedup for the aligned cases. Probably a red herring because I didn't fix
> the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads
> and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.
>
> Ok, now let's keep src aligned and only shift dst.
>
> 0: 24153
> 1: 128886
>
> 31: 127698
> 32: 22206
> 33: 120669
>
> 63: 113868
> 64: 20596
> 65: 79906
>
> 95: 80180
> 96: 20392
> 97: 62200
>
> 127: 63056
> 128: 20636
> 129: 56148
>
> 159: 55502
> 160: 20600
> 161: 46962
>
> 191: 46856
> 192: 20394
> 193: 44454
>
> 223: 44404
> 224: 20284
> 225: 39622
>
> This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the
> unaligned performance appears to improve, with a noticeable step each time we do another aligned
> round. Towards the end I see the performance results I would have expected throughout.
>
> If I copy the entire benchmark loop a few times, I get the same results back to back.
> So this is not a warmup-problem, the offsets between src and dst appear to matter.
> So finally I shifted src by 256 bytes and now I get reasonable results again.
>
> 0: 20481
> 1: 39663
>
> 31: 38526
> 32: 19767
> 33: 38526
>
> 63: 38523
> 64: 19764
> 65: 38529
>
> 95: 38523
> 96: 19767
> 97: 38520
>
> 127: 38523
> 128: 19668
> 129: 38517
>
> Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were
> competing for the same cachelines. Modulo the offset the two were exactly 16k apart.
>
> If someone has a good explanation, I'd love to hear it.
I don't like your hypothesis. The cache has 8 ways, that's a lot for a simple loop that in theory should be happy with 2 ways.
But I don't have any better theory.
May be, it's something related to store-to-load predictor mistakenly predicting a hit that causes a replay down the road?
If you publish your source code then people could try on different CPUs. Close relatives of Skylake, i.e. Haswell, Broadwell and SKX, would be the most interesting.