By: --- (---.delete@this.redheron.com), September 23, 2021 11:53 am

Room: Moderated Discussions

dmcq (dmcq.delete@this.fano.co.uk) on September 23, 2021 10:42 am wrote:

> Jörn Engel (joern.delete@this.purestorage.com) on September 23, 2021 5:10 am wrote:

> > Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:

> > >

> > > Not sure. I'll have to play around with the code a bit.

> >

> > Looks like I have to eat my words. As usual, it was easier to write my own benchmark

> > than modify yours. The results roughly match. But to mix things up I tried a copy

> > using AVX2 (technically just AVX, I think). Here things get interesting.

> >

> > If I add an offset to both src and dst there is a 2x performance difference.

> > Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to

> > have kicked in, making the numbers look a bit better than they should.

> >

> > 0: 23418

> > 1: 42909

> >

> > 31: 42798

> > 32: 21552

> > 33: 43128

> >

> > 63: 43146

> > 64: 21552

> > 65: 42657

> >

> > 95: 42666

> > 96: 21507

> > 97: 43239

> >

> > 127: 43233

> > 128: 21483

> > 129: 42795

> >

> > So what happens if we keep dst aligned and only shift the src?

> >

> > 0: 21642

> > 1: 26088

> >

> > 31: 25268

> > 32: 19208

> > 33: 25194

> >

> > 63: 25338

> > 64: 19474

> > 65: 25440

> >

> > 95: 24962

> > 96: 19206

> > 97: 25120

> >

> > 127: 25104

> > 128: 19214

> > 129: 25054

> >

> > We get a small speedup for the aligned cases. Probably a red herring because I didn't fix

> > the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads

> > and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.

> >

> > Ok, now let's keep src aligned and only shift dst.

> >

> > 0: 24153

> > 1: 128886

> >

> > 31: 127698

> > 32: 22206

> > 33: 120669

> >

> > 63: 113868

> > 64: 20596

> > 65: 79906

> >

> > 95: 80180

> > 96: 20392

> > 97: 62200

> >

> > 127: 63056

> > 128: 20636

> > 129: 56148

> >

> > 159: 55502

> > 160: 20600

> > 161: 46962

> >

> > 191: 46856

> > 192: 20394

> > 193: 44454

> >

> > 223: 44404

> > 224: 20284

> > 225: 39622

> >

> > This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the

> > unaligned performance appears to improve, with a noticeable step each time we do another aligned

> > round. Towards the end I see the performance results I would have expected throughout.

> >

> > If I copy the entire benchmark loop a few times, I get the same results back to back.

> > So this is not a warmup-problem, the offsets between src and dst appear to matter.

> > So finally I shifted src by 256 bytes and now I get reasonable results again.

> >

> > 0: 20481

> > 1: 39663

> >

> > 31: 38526

> > 32: 19767

> > 33: 38526

> >

> > 63: 38523

> > 64: 19764

> > 65: 38529

> >

> > 95: 38523

> > 96: 19767

> > 97: 38520

> >

> > 127: 38523

> > 128: 19668

> > 129: 38517

> >

> > Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were

> > competing for the same cachelines. Modulo the offset the two were exactly 16k apart.

> >

> > If someone has a good explanation, I'd love to hear it.

>

> I think L1 caches would be better if they used low-discrepancy quasirandom number sequences instead, or just

> modulo some random number, to avoid falling prey to that sort of effect too easily - evenif they didn't use

> every single line of some power of two! Need to use virtual adresses though to get that working well though.

For anything below L2, it's really hard to construct a hash that is fast enough to be practical..

There are alternatives to this idea that are probably more practical; generally along the lines of a secondary, very small and close to fully associative cache, that can hold the small excess created by, eg, excessive activity within one set.

This same philosophy is of general utility; eg rather than having an expensive fully associative large store queue, make it two-way set associative with a small (8 element or so) fully associative annex. Of course there is a complexity cost; you're now looking for (and ensuring synchronization between) one more storage locker; but the energy savings (and even performance boosts because your fully associative large structure has size limited by cycle time) can be substantial.

> Jörn Engel (joern.delete@this.purestorage.com) on September 23, 2021 5:10 am wrote:

> > Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:

> > >

> > > Not sure. I'll have to play around with the code a bit.

> >

> > Looks like I have to eat my words. As usual, it was easier to write my own benchmark

> > than modify yours. The results roughly match. But to mix things up I tried a copy

> > using AVX2 (technically just AVX, I think). Here things get interesting.

> >

> > If I add an offset to both src and dst there is a 2x performance difference.

> > Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to

> > have kicked in, making the numbers look a bit better than they should.

> >

> > 0: 23418

> > 1: 42909

> >

> > 31: 42798

> > 32: 21552

> > 33: 43128

> >

> > 63: 43146

> > 64: 21552

> > 65: 42657

> >

> > 95: 42666

> > 96: 21507

> > 97: 43239

> >

> > 127: 43233

> > 128: 21483

> > 129: 42795

> >

> > So what happens if we keep dst aligned and only shift the src?

> >

> > 0: 21642

> > 1: 26088

> >

> > 31: 25268

> > 32: 19208

> > 33: 25194

> >

> > 63: 25338

> > 64: 19474

> > 65: 25440

> >

> > 95: 24962

> > 96: 19206

> > 97: 25120

> >

> > 127: 25104

> > 128: 19214

> > 129: 25054

> >

> > We get a small speedup for the aligned cases. Probably a red herring because I didn't fix

> > the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads

> > and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.

> >

> > Ok, now let's keep src aligned and only shift dst.

> >

> > 0: 24153

> > 1: 128886

> >

> > 31: 127698

> > 32: 22206

> > 33: 120669

> >

> > 63: 113868

> > 64: 20596

> > 65: 79906

> >

> > 95: 80180

> > 96: 20392

> > 97: 62200

> >

> > 127: 63056

> > 128: 20636

> > 129: 56148

> >

> > 159: 55502

> > 160: 20600

> > 161: 46962

> >

> > 191: 46856

> > 192: 20394

> > 193: 44454

> >

> > 223: 44404

> > 224: 20284

> > 225: 39622

> >

> > This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the

> > unaligned performance appears to improve, with a noticeable step each time we do another aligned

> > round. Towards the end I see the performance results I would have expected throughout.

> >

> > If I copy the entire benchmark loop a few times, I get the same results back to back.

> > So this is not a warmup-problem, the offsets between src and dst appear to matter.

> > So finally I shifted src by 256 bytes and now I get reasonable results again.

> >

> > 0: 20481

> > 1: 39663

> >

> > 31: 38526

> > 32: 19767

> > 33: 38526

> >

> > 63: 38523

> > 64: 19764

> > 65: 38529

> >

> > 95: 38523

> > 96: 19767

> > 97: 38520

> >

> > 127: 38523

> > 128: 19668

> > 129: 38517

> >

> > Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were

> > competing for the same cachelines. Modulo the offset the two were exactly 16k apart.

> >

> > If someone has a good explanation, I'd love to hear it.

>

> I think L1 caches would be better if they used low-discrepancy quasirandom number sequences instead, or just

> modulo some random number, to avoid falling prey to that sort of effect too easily - evenif they didn't use

> every single line of some power of two! Need to use virtual adresses though to get that working well though.

For anything below L2, it's really hard to construct a hash that is fast enough to be practical..

There are alternatives to this idea that are probably more practical; generally along the lines of a secondary, very small and close to fully associative cache, that can hold the small excess created by, eg, excessive activity within one set.

This same philosophy is of general utility; eg rather than having an expensive fully associative large store queue, make it two-way set associative with a small (8 element or so) fully associative annex. Of course there is a complexity cost; you're now looking for (and ensuring synchronization between) one more storage locker; but the energy savings (and even performance boosts because your fully associative large structure has size limited by cycle time) can be substantial.