By: --- (---.delete@this.redheron.com), September 23, 2021 11:53 am
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on September 23, 2021 10:42 am wrote:
> Jörn Engel (joern.delete@this.purestorage.com) on September 23, 2021 5:10 am wrote:
> > Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:
> > >
> > > Not sure. I'll have to play around with the code a bit.
> >
> > Looks like I have to eat my words. As usual, it was easier to write my own benchmark
> > than modify yours. The results roughly match. But to mix things up I tried a copy
> > using AVX2 (technically just AVX, I think). Here things get interesting.
> >
> > If I add an offset to both src and dst there is a 2x performance difference.
> > Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to
> > have kicked in, making the numbers look a bit better than they should.
> >
> > 0: 23418
> > 1: 42909
> >
> > 31: 42798
> > 32: 21552
> > 33: 43128
> >
> > 63: 43146
> > 64: 21552
> > 65: 42657
> >
> > 95: 42666
> > 96: 21507
> > 97: 43239
> >
> > 127: 43233
> > 128: 21483
> > 129: 42795
> >
> > So what happens if we keep dst aligned and only shift the src?
> >
> > 0: 21642
> > 1: 26088
> >
> > 31: 25268
> > 32: 19208
> > 33: 25194
> >
> > 63: 25338
> > 64: 19474
> > 65: 25440
> >
> > 95: 24962
> > 96: 19206
> > 97: 25120
> >
> > 127: 25104
> > 128: 19214
> > 129: 25054
> >
> > We get a small speedup for the aligned cases. Probably a red herring because I didn't fix
> > the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads
> > and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.
> >
> > Ok, now let's keep src aligned and only shift dst.
> >
> > 0: 24153
> > 1: 128886
> >
> > 31: 127698
> > 32: 22206
> > 33: 120669
> >
> > 63: 113868
> > 64: 20596
> > 65: 79906
> >
> > 95: 80180
> > 96: 20392
> > 97: 62200
> >
> > 127: 63056
> > 128: 20636
> > 129: 56148
> >
> > 159: 55502
> > 160: 20600
> > 161: 46962
> >
> > 191: 46856
> > 192: 20394
> > 193: 44454
> >
> > 223: 44404
> > 224: 20284
> > 225: 39622
> >
> > This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the
> > unaligned performance appears to improve, with a noticeable step each time we do another aligned
> > round. Towards the end I see the performance results I would have expected throughout.
> >
> > If I copy the entire benchmark loop a few times, I get the same results back to back.
> > So this is not a warmup-problem, the offsets between src and dst appear to matter.
> > So finally I shifted src by 256 bytes and now I get reasonable results again.
> >
> > 0: 20481
> > 1: 39663
> >
> > 31: 38526
> > 32: 19767
> > 33: 38526
> >
> > 63: 38523
> > 64: 19764
> > 65: 38529
> >
> > 95: 38523
> > 96: 19767
> > 97: 38520
> >
> > 127: 38523
> > 128: 19668
> > 129: 38517
> >
> > Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were
> > competing for the same cachelines. Modulo the offset the two were exactly 16k apart.
> >
> > If someone has a good explanation, I'd love to hear it.
>
> I think L1 caches would be better if they used low-discrepancy quasirandom number sequences instead, or just
> modulo some random number, to avoid falling prey to that sort of effect too easily - evenif they didn't use
> every single line of some power of two! Need to use virtual adresses though to get that working well though.
For anything below L2, it's really hard to construct a hash that is fast enough to be practical..
There are alternatives to this idea that are probably more practical; generally along the lines of a secondary, very small and close to fully associative cache, that can hold the small excess created by, eg, excessive activity within one set.
This same philosophy is of general utility; eg rather than having an expensive fully associative large store queue, make it two-way set associative with a small (8 element or so) fully associative annex. Of course there is a complexity cost; you're now looking for (and ensuring synchronization between) one more storage locker; but the energy savings (and even performance boosts because your fully associative large structure has size limited by cycle time) can be substantial.
> Jörn Engel (joern.delete@this.purestorage.com) on September 23, 2021 5:10 am wrote:
> > Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:
> > >
> > > Not sure. I'll have to play around with the code a bit.
> >
> > Looks like I have to eat my words. As usual, it was easier to write my own benchmark
> > than modify yours. The results roughly match. But to mix things up I tried a copy
> > using AVX2 (technically just AVX, I think). Here things get interesting.
> >
> > If I add an offset to both src and dst there is a 2x performance difference.
> > Numbers are cycles for 100 loops, each copying 8kB. Turboboost seems to
> > have kicked in, making the numbers look a bit better than they should.
> >
> > 0: 23418
> > 1: 42909
> >
> > 31: 42798
> > 32: 21552
> > 33: 43128
> >
> > 63: 43146
> > 64: 21552
> > 65: 42657
> >
> > 95: 42666
> > 96: 21507
> > 97: 43239
> >
> > 127: 43233
> > 128: 21483
> > 129: 42795
> >
> > So what happens if we keep dst aligned and only shift the src?
> >
> > 0: 21642
> > 1: 26088
> >
> > 31: 25268
> > 32: 19208
> > 33: 25194
> >
> > 63: 25338
> > 64: 19474
> > 65: 25440
> >
> > 95: 24962
> > 96: 19206
> > 97: 25120
> >
> > 127: 25104
> > 128: 19214
> > 129: 25054
> >
> > We get a small speedup for the aligned cases. Probably a red herring because I didn't fix
> > the frequencies. And we get a large speedup for the unaligned cases. This CPU can do 2 reads
> > and 1 write per cycle, so the unaligned reads have mostly been removed as a bottleneck.
> >
> > Ok, now let's keep src aligned and only shift dst.
> >
> > 0: 24153
> > 1: 128886
> >
> > 31: 127698
> > 32: 22206
> > 33: 120669
> >
> > 63: 113868
> > 64: 20596
> > 65: 79906
> >
> > 95: 80180
> > 96: 20392
> > 97: 62200
> >
> > 127: 63056
> > 128: 20636
> > 129: 56148
> >
> > 159: 55502
> > 160: 20600
> > 161: 46962
> >
> > 191: 46856
> > 192: 20394
> > 193: 44454
> >
> > 223: 44404
> > 224: 20284
> > 225: 39622
> >
> > This is crazy. Performance for the unaligned cases is 5x higher instead of 2x. But then the
> > unaligned performance appears to improve, with a noticeable step each time we do another aligned
> > round. Towards the end I see the performance results I would have expected throughout.
> >
> > If I copy the entire benchmark loop a few times, I get the same results back to back.
> > So this is not a warmup-problem, the offsets between src and dst appear to matter.
> > So finally I shifted src by 256 bytes and now I get reasonable results again.
> >
> > 0: 20481
> > 1: 39663
> >
> > 31: 38526
> > 32: 19767
> > 33: 38526
> >
> > 63: 38523
> > 64: 19764
> > 65: 38529
> >
> > 95: 38523
> > 96: 19767
> > 97: 38520
> >
> > 127: 38523
> > 128: 19668
> > 129: 38517
> >
> > Not sure how to explain the crazy numbers, but the CPU behaves as if src and dst were
> > competing for the same cachelines. Modulo the offset the two were exactly 16k apart.
> >
> > If someone has a good explanation, I'd love to hear it.
>
> I think L1 caches would be better if they used low-discrepancy quasirandom number sequences instead, or just
> modulo some random number, to avoid falling prey to that sort of effect too easily - evenif they didn't use
> every single line of some power of two! Need to use virtual adresses though to get that working well though.
For anything below L2, it's really hard to construct a hash that is fast enough to be practical..
There are alternatives to this idea that are probably more practical; generally along the lines of a secondary, very small and close to fully associative cache, that can hold the small excess created by, eg, excessive activity within one set.
This same philosophy is of general utility; eg rather than having an expensive fully associative large store queue, make it two-way set associative with a small (8 element or so) fully associative annex. Of course there is a complexity cost; you're now looking for (and ensuring synchronization between) one more storage locker; but the energy savings (and even performance boosts because your fully associative large structure has size limited by cycle time) can be substantial.