By: dmcq (dmcq.delete@this.fano.co.uk), September 23, 2021 10:42 am
Room: Moderated Discussions
Jörn Engel (joern.delete@this.purestorage.com) on September 23, 2021 5:10 am wrote:
> Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:
> >
> > Not sure. I'll have to play around with the code a bit.
>
> Looks like I have to eat my words. As usual, it was easier to write my own benchmark
> than modify yours. The results roughly match. But to mix things up I tried a copy
> using AVX2 (technically just AVX, I think). Here things get interesting.
>
> If I add an offset to both src and dst there is a 2x performance difference.
> Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to
> have kicked in, making the numbers look a bit better than they should.
>
> 0: 23418
> 1: 42909
>
> 31: 42798
> 32: 21552
> 33: 43128
>
> 63: 43146
> 64: 21552
> 65: 42657
>
> 95: 42666
> 96: 21507
> 97: 43239
>
> 127: 43233
> 128: 21483
> 129: 42795
>
> So what happens if we keep dst aligned and only shift the src?
>
> 0: 21642
> 1: 26088
>
> 31: 25268
> 32: 19208
> 33: 25194
>
> 63: 25338
> 64: 19474
> 65: 25440
>
> 95: 24962
> 96: 19206
> 97: 25120
>
> 127: 25104
> 128: 19214
> 129: 25054
>
> We get a small speedup for the aligned cases. Probably a red herring because I didn't fix
> the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads
> and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.
>
> Ok, now let's keep src aligned and only shift dst.
>
> 0: 24153
> 1: 128886
>
> 31: 127698
> 32: 22206
> 33: 120669
>
> 63: 113868
> 64: 20596
> 65: 79906
>
> 95: 80180
> 96: 20392
> 97: 62200
>
> 127: 63056
> 128: 20636
> 129: 56148
>
> 159: 55502
> 160: 20600
> 161: 46962
>
> 191: 46856
> 192: 20394
> 193: 44454
>
> 223: 44404
> 224: 20284
> 225: 39622
>
> This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the
> unaligned performance appears to improve, with a noticeable step each time we do another aligned
> round. Towards the end I see the performance results I would have expected throughout.
>
> If I copy the entire benchmark loop a few times, I get the same results back to back.
> So this is not a warmup-problem, the offsets between src and dst appear to matter.
> So finally I shifted src by 256 bytes and now I get reasonable results again.
>
> 0: 20481
> 1: 39663
>
> 31: 38526
> 32: 19767
> 33: 38526
>
> 63: 38523
> 64: 19764
> 65: 38529
>
> 95: 38523
> 96: 19767
> 97: 38520
>
> 127: 38523
> 128: 19668
> 129: 38517
>
> Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were
> competing for the same cachelines. Modulo the offset the two were exactly 16k apart.
>
> If someone has a good explanation, I'd love to hear it.
I think L1 caches would be better if they used low-discrepancy quasirandom number sequences instead, or just modulo some random number, to avoid falling prey to that sort of effect too easily - evenif they didn't use every single line of some power of two! Need to use virtual adresses though to get that working well though.
> Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:
> >
> > Not sure. I'll have to play around with the code a bit.
>
> Looks like I have to eat my words. As usual, it was easier to write my own benchmark
> than modify yours. The results roughly match. But to mix things up I tried a copy
> using AVX2 (technically just AVX, I think). Here things get interesting.
>
> If I add an offset to both src and dst there is a 2x performance difference.
> Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to
> have kicked in, making the numbers look a bit better than they should.
>
> 0: 23418
> 1: 42909
>
> 31: 42798
> 32: 21552
> 33: 43128
>
> 63: 43146
> 64: 21552
> 65: 42657
>
> 95: 42666
> 96: 21507
> 97: 43239
>
> 127: 43233
> 128: 21483
> 129: 42795
>
> So what happens if we keep dst aligned and only shift the src?
>
> 0: 21642
> 1: 26088
>
> 31: 25268
> 32: 19208
> 33: 25194
>
> 63: 25338
> 64: 19474
> 65: 25440
>
> 95: 24962
> 96: 19206
> 97: 25120
>
> 127: 25104
> 128: 19214
> 129: 25054
>
> We get a small speedup for the aligned cases. Probably a red herring because I didn't fix
> the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads
> and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.
>
> Ok, now let's keep src aligned and only shift dst.
>
> 0: 24153
> 1: 128886
>
> 31: 127698
> 32: 22206
> 33: 120669
>
> 63: 113868
> 64: 20596
> 65: 79906
>
> 95: 80180
> 96: 20392
> 97: 62200
>
> 127: 63056
> 128: 20636
> 129: 56148
>
> 159: 55502
> 160: 20600
> 161: 46962
>
> 191: 46856
> 192: 20394
> 193: 44454
>
> 223: 44404
> 224: 20284
> 225: 39622
>
> This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the
> unaligned performance appears to improve, with a noticeable step each time we do another aligned
> round. Towards the end I see the performance results I would have expected throughout.
>
> If I copy the entire benchmark loop a few times, I get the same results back to back.
> So this is not a warmup-problem, the offsets between src and dst appear to matter.
> So finally I shifted src by 256 bytes and now I get reasonable results again.
>
> 0: 20481
> 1: 39663
>
> 31: 38526
> 32: 19767
> 33: 38526
>
> 63: 38523
> 64: 19764
> 65: 38529
>
> 95: 38523
> 96: 19767
> 97: 38520
>
> 127: 38523
> 128: 19668
> 129: 38517
>
> Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were
> competing for the same cachelines. Modulo the offset the two were exactly 16k apart.
>
> If someone has a good explanation, I'd love to hear it.
I think L1 caches would be better if they used low-discrepancy quasirandom number sequences instead, or just modulo some random number, to avoid falling prey to that sort of effect too easily - evenif they didn't use every single line of some power of two! Need to use virtual adresses though to get that working well though.