By: Jörn Engel (joern.delete@this.purestorage.com), September 19, 2021 8:46 pm
Room: Moderated Discussions
Anon (no.delete@this.spam.com) on September 19, 2021 5:32 pm wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on September 19, 2021 4:46 pm wrote:
> > > It's a night here now.
> >
> > So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers
> > (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different
> > access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.
> >
> > Here are results:
> > 8-byte (64b) accesses:
> > 0 1064
> > 1 1170
> >
> > 16-byte (128b) accesses:
> > 0 483
> > 1 701
> >
> > 32-byte (256b) accesses:
> > 0 256
> > 1 468
> >
> > Misalignment penalty [of streaming add):
> > 8-byte - 1.10x
> > 16-byte - 1.45x
> > 32-byte - 1.83x
Thank you!
> I think you should point out when the access cross a cache line or not.
Interesting idea. With 8-byte access, roughly 12.5% off accesses should cross a cacheline. Ratio goes up with access size, as does the misalignment penalty. The numbers don't quite match up, but a lot of the measurements could be explained if performance was limited by the numbers of cachelines read.
1064 - 1.77 cachelines / cycle
1170 - 1.81 cachelines / cycle
483 - 1.95 cachelines / cycle
701 - 1.67 cachelines / cycle
256 - 1.83 cachelines / cycle
468 - 1.51 cachelines / cycle
Not sure. I'll have to play around with the code a bit.
> Michael S (already5chosen.delete@this.yahoo.com) on September 19, 2021 4:46 pm wrote:
> > > It's a night here now.
> >
> > So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers
> > (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different
> > access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.
> >
> > Here are results:
> > 8-byte (64b) accesses:
> > 0 1064
> > 1 1170
> >
> > 16-byte (128b) accesses:
> > 0 483
> > 1 701
> >
> > 32-byte (256b) accesses:
> > 0 256
> > 1 468
> >
> > Misalignment penalty [of streaming add):
> > 8-byte - 1.10x
> > 16-byte - 1.45x
> > 32-byte - 1.83x
Thank you!
> I think you should point out when the access cross a cache line or not.
Interesting idea. With 8-byte access, roughly 12.5% off accesses should cross a cacheline. Ratio goes up with access size, as does the misalignment penalty. The numbers don't quite match up, but a lot of the measurements could be explained if performance was limited by the numbers of cachelines read.
1064 - 1.77 cachelines / cycle
1170 - 1.81 cachelines / cycle
483 - 1.95 cachelines / cycle
701 - 1.67 cachelines / cycle
256 - 1.83 cachelines / cycle
468 - 1.51 cachelines / cycle
Not sure. I'll have to play around with the code a bit.