By: Jörn Engel (joern.delete@this.purestorage.com), September 23, 2021 5:10 am
Room: Moderated Discussions
Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:
>
> Not sure. I'll have to play around with the code a bit.
Looks like I have to eat my words. As usual, it was easier to write my own benchmark than modify yours. The results roughly match. But to mix things up I tried a copy using AVX2 (technically just AVX, I think). Here things get interesting.
If I add an offset to both src and dst there is a 2x performance difference. Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to have kicked in, making the numbers look a bit better than they should.
0: 23418
1: 42909
31: 42798
32: 21552
33: 43128
63: 43146
64: 21552
65: 42657
95: 42666
96: 21507
97: 43239
127: 43233
128: 21483
129: 42795
So what happens if we keep dst aligned and only shift the src?
0: 21642
1: 26088
31: 25268
32: 19208
33: 25194
63: 25338
64: 19474
65: 25440
95: 24962
96: 19206
97: 25120
127: 25104
128: 19214
129: 25054
We get a small speedup for the aligned cases. Probably a red herring because I didn't fix the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.
Ok, now let's keep src aligned and only shift dst.
0: 24153
1: 128886
31: 127698
32: 22206
33: 120669
63: 113868
64: 20596
65: 79906
95: 80180
96: 20392
97: 62200
127: 63056
128: 20636
129: 56148
159: 55502
160: 20600
161: 46962
191: 46856
192: 20394
193: 44454
223: 44404
224: 20284
225: 39622
This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the unaligned performance appears to improve, with a noticeable step each time we do another aligned round. Towards the end I see the performance results I would have expected throughout.
If I copy the entire benchmark loop a few times, I get the same results back to back. So this is not a warmup-problem, the offsets between src and dst appear to matter. So finally I shifted src by 256 bytes and now I get reasonable results again.
0: 20481
1: 39663
31: 38526
32: 19767
33: 38526
63: 38523
64: 19764
65: 38529
95: 38523
96: 19767
97: 38520
127: 38523
128: 19668
129: 38517
Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were competing for the same cachelines. Modulo the offset the two were exactly 16k apart.
If someone has a good explanation, I'd love to hear it.
>
> Not sure. I'll have to play around with the code a bit.
Looks like I have to eat my words. As usual, it was easier to write my own benchmark than modify yours. The results roughly match. But to mix things up I tried a copy using AVX2 (technically just AVX, I think). Here things get interesting.
If I add an offset to both src and dst there is a 2x performance difference. Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to have kicked in, making the numbers look a bit better than they should.
0: 23418
1: 42909
31: 42798
32: 21552
33: 43128
63: 43146
64: 21552
65: 42657
95: 42666
96: 21507
97: 43239
127: 43233
128: 21483
129: 42795
So what happens if we keep dst aligned and only shift the src?
0: 21642
1: 26088
31: 25268
32: 19208
33: 25194
63: 25338
64: 19474
65: 25440
95: 24962
96: 19206
97: 25120
127: 25104
128: 19214
129: 25054
We get a small speedup for the aligned cases. Probably a red herring because I didn't fix the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.
Ok, now let's keep src aligned and only shift dst.
0: 24153
1: 128886
31: 127698
32: 22206
33: 120669
63: 113868
64: 20596
65: 79906
95: 80180
96: 20392
97: 62200
127: 63056
128: 20636
129: 56148
159: 55502
160: 20600
161: 46962
191: 46856
192: 20394
193: 44454
223: 44404
224: 20284
225: 39622
This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the unaligned performance appears to improve, with a noticeable step each time we do another aligned round. Towards the end I see the performance results I would have expected throughout.
If I copy the entire benchmark loop a few times, I get the same results back to back. So this is not a warmup-problem, the offsets between src and dst appear to matter. So finally I shifted src by 256 bytes and now I get reasonable results again.
0: 20481
1: 39663
31: 38526
32: 19767
33: 38526
63: 38523
64: 19764
65: 38529
95: 38523
96: 19767
97: 38520
127: 38523
128: 19668
129: 38517
Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were competing for the same cachelines. Modulo the offset the two were exactly 16k apart.
If someone has a good explanation, I'd love to hear it.