By: dmcq (dmcq.delete@this.fano.co.uk), September 23, 2021 10:42 am

Room: Moderated Discussions

Jörn Engel (joern.delete@this.purestorage.com) on September 23, 2021 5:10 am wrote:

> Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:

> >

> > Not sure. I'll have to play around with the code a bit.

>

> Looks like I have to eat my words. As usual, it was easier to write my own benchmark

> than modify yours. The results roughly match. But to mix things up I tried a copy

> using AVX2 (technically just AVX, I think). Here things get interesting.

>

> If I add an offset to both src and dst there is a 2x performance difference.

> Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to

> have kicked in, making the numbers look a bit better than they should.

>

> 0: 23418

> 1: 42909

>

> 31: 42798

> 32: 21552

> 33: 43128

>

> 63: 43146

> 64: 21552

> 65: 42657

>

> 95: 42666

> 96: 21507

> 97: 43239

>

> 127: 43233

> 128: 21483

> 129: 42795

>

> So what happens if we keep dst aligned and only shift the src?

>

> 0: 21642

> 1: 26088

>

> 31: 25268

> 32: 19208

> 33: 25194

>

> 63: 25338

> 64: 19474

> 65: 25440

>

> 95: 24962

> 96: 19206

> 97: 25120

>

> 127: 25104

> 128: 19214

> 129: 25054

>

> We get a small speedup for the aligned cases. Probably a red herring because I didn't fix

> the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads

> and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.

>

> Ok, now let's keep src aligned and only shift dst.

>

> 0: 24153

> 1: 128886

>

> 31: 127698

> 32: 22206

> 33: 120669

>

> 63: 113868

> 64: 20596

> 65: 79906

>

> 95: 80180

> 96: 20392

> 97: 62200

>

> 127: 63056

> 128: 20636

> 129: 56148

>

> 159: 55502

> 160: 20600

> 161: 46962

>

> 191: 46856

> 192: 20394

> 193: 44454

>

> 223: 44404

> 224: 20284

> 225: 39622

>

> This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the

> unaligned performance appears to improve, with a noticeable step each time we do another aligned

> round. Towards the end I see the performance results I would have expected throughout.

>

> If I copy the entire benchmark loop a few times, I get the same results back to back.

> So this is not a warmup-problem, the offsets between src and dst appear to matter.

> So finally I shifted src by 256 bytes and now I get reasonable results again.

>

> 0: 20481

> 1: 39663

>

> 31: 38526

> 32: 19767

> 33: 38526

>

> 63: 38523

> 64: 19764

> 65: 38529

>

> 95: 38523

> 96: 19767

> 97: 38520

>

> 127: 38523

> 128: 19668

> 129: 38517

>

> Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were

> competing for the same cachelines. Modulo the offset the two were exactly 16k apart.

>

> If someone has a good explanation, I'd love to hear it.

I think L1 caches would be better if they used low-discrepancy quasirandom number sequences instead, or just modulo some random number, to avoid falling prey to that sort of effect too easily - evenif they didn't use every single line of some power of two! Need to use virtual adresses though to get that working well though.

> Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:

> >

> > Not sure. I'll have to play around with the code a bit.

>

> Looks like I have to eat my words. As usual, it was easier to write my own benchmark

> than modify yours. The results roughly match. But to mix things up I tried a copy

> using AVX2 (technically just AVX, I think). Here things get interesting.

>

> If I add an offset to both src and dst there is a 2x performance difference.

> Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to

> have kicked in, making the numbers look a bit better than they should.

>

> 0: 23418

> 1: 42909

>

> 31: 42798

> 32: 21552

> 33: 43128

>

> 63: 43146

> 64: 21552

> 65: 42657

>

> 95: 42666

> 96: 21507

> 97: 43239

>

> 127: 43233

> 128: 21483

> 129: 42795

>

> So what happens if we keep dst aligned and only shift the src?

>

> 0: 21642

> 1: 26088

>

> 31: 25268

> 32: 19208

> 33: 25194

>

> 63: 25338

> 64: 19474

> 65: 25440

>

> 95: 24962

> 96: 19206

> 97: 25120

>

> 127: 25104

> 128: 19214

> 129: 25054

>

> We get a small speedup for the aligned cases. Probably a red herring because I didn't fix

> the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads

> and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.

>

> Ok, now let's keep src aligned and only shift dst.

>

> 0: 24153

> 1: 128886

>

> 31: 127698

> 32: 22206

> 33: 120669

>

> 63: 113868

> 64: 20596

> 65: 79906

>

> 95: 80180

> 96: 20392

> 97: 62200

>

> 127: 63056

> 128: 20636

> 129: 56148

>

> 159: 55502

> 160: 20600

> 161: 46962

>

> 191: 46856

> 192: 20394

> 193: 44454

>

> 223: 44404

> 224: 20284

> 225: 39622

>

> This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the

> unaligned performance appears to improve, with a noticeable step each time we do another aligned

> round. Towards the end I see the performance results I would have expected throughout.

>

> If I copy the entire benchmark loop a few times, I get the same results back to back.

> So this is not a warmup-problem, the offsets between src and dst appear to matter.

> So finally I shifted src by 256 bytes and now I get reasonable results again.

>

> 0: 20481

> 1: 39663

>

> 31: 38526

> 32: 19767

> 33: 38526

>

> 63: 38523

> 64: 19764

> 65: 38529

>

> 95: 38523

> 96: 19767

> 97: 38520

>

> 127: 38523

> 128: 19668

> 129: 38517

>

> Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were

> competing for the same cachelines. Modulo the offset the two were exactly 16k apart.

>

> If someone has a good explanation, I'd love to hear it.

I think L1 caches would be better if they used low-discrepancy quasirandom number sequences instead, or just modulo some random number, to avoid falling prey to that sort of effect too easily - evenif they didn't use every single line of some power of two! Need to use virtual adresses though to get that working well though.