By: dmcq (dmcq.delete@this.fano.co.uk), September 20, 2021 2:19 am
Room: Moderated Discussions
Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:
> Anon (no.delete@this.spam.com) on September 19, 2021 5:32 pm wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on September 19, 2021 4:46 pm wrote:
> > > > It's a night here now.
> > >
> > > So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers
> > > (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different
> > > access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.
> > >
> > > Here are results:
> > > 8-byte (64b) accesses:
> > > 0 1064
> > > 1 1170
> > >
> > > 16-byte (128b) accesses:
> > > 0 483
> > > 1 701
> > >
> > > 32-byte (256b) accesses:
> > > 0 256
> > > 1 468
> > >
> > > Misalignment penalty [of streaming add):
> > > 8-byte - 1.10x
> > > 16-byte - 1.45x
> > > 32-byte - 1.83x
>
> Thank you!
>
> > I think you should point out when the access cross a cache line or not.
>
> Interesting idea. With 8-byte access, roughly 12.5% off accesses should cross a cacheline. Ratio goes
> up with access size, as does the misalignment penalty. The numbers don't quite match up, but a lot of
> the measurements could be explained if performance was limited by the numbers of cachelines read.
>
> 1064 - 1.77 cachelines / cycle
> 1170 - 1.81 cachelines / cycle
> 483 - 1.95 cachelines / cycle
> 701 - 1.67 cachelines / cycle
> 256 - 1.83 cachelines / cycle
> 468 - 1.51 cachelines / cycle
>
> Not sure. I'll have to play around with the code a bit.
It'd be interesting to see how Supercomputer Fugaku fares thst way as they said they were doing special work accessing two cache lines at once to deal with their 512 bit reads and writes being split in two like that.
> Anon (no.delete@this.spam.com) on September 19, 2021 5:32 pm wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on September 19, 2021 4:46 pm wrote:
> > > > It's a night here now.
> > >
> > > So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers
> > > (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different
> > > access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.
> > >
> > > Here are results:
> > > 8-byte (64b) accesses:
> > > 0 1064
> > > 1 1170
> > >
> > > 16-byte (128b) accesses:
> > > 0 483
> > > 1 701
> > >
> > > 32-byte (256b) accesses:
> > > 0 256
> > > 1 468
> > >
> > > Misalignment penalty [of streaming add):
> > > 8-byte - 1.10x
> > > 16-byte - 1.45x
> > > 32-byte - 1.83x
>
> Thank you!
>
> > I think you should point out when the access cross a cache line or not.
>
> Interesting idea. With 8-byte access, roughly 12.5% off accesses should cross a cacheline. Ratio goes
> up with access size, as does the misalignment penalty. The numbers don't quite match up, but a lot of
> the measurements could be explained if performance was limited by the numbers of cachelines read.
>
> 1064 - 1.77 cachelines / cycle
> 1170 - 1.81 cachelines / cycle
> 483 - 1.95 cachelines / cycle
> 701 - 1.67 cachelines / cycle
> 256 - 1.83 cachelines / cycle
> 468 - 1.51 cachelines / cycle
>
> Not sure. I'll have to play around with the code a bit.
It'd be interesting to see how Supercomputer Fugaku fares thst way as they said they were doing special work accessing two cache lines at once to deal with their 512 bit reads and writes being split in two like that.