By: Anon (no.delete@this.spam.com), September 19, 2021 5:32 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on September 19, 2021 4:46 pm wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on September 18, 2021 3:58 pm wrote:
> > Jörn Engel (joern.delete@this.purestorage.com) on September 18, 2021 2:01 pm wrote:
> > > Michael S (already5chosen.delete@this.yahoo.com) on September 17, 2021 7:48 am wrote:
> > > > Jörn Engel (joern.delete@this.purestorage.com) on September 17, 2021 5:42 am wrote:
> > > > >
> > > > > It won't. Unaligned access is a solved problem on any CPU
> > > > > that cares about performance. On Intel the difference
> > > > > between vmovdqu and vmovdqa on aligned data is zero - both
> > > > > instructions are equally fast. vmovdqu on unaligned
> > > > > data is maybe 10% slower than on aligned data, not a big deal either.
> > > >
> > > > 256-bit form is only 10% slower for L1D hit? May be, taken individually. But in the tight
> > > > loop, like in memcpy or integer variant of Stream Add, I'd expect it to be ~1.5x slower.
> > >
> > > Care to test your expectation? I tend to trust empirical results more than human expectations.
> > >
> > > Independent reproduction of my results:
> > > https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/
> >
> > For 32-bit and 64-bit data elements I'd also expect small penalty. On today's CPU
> > for 32-bit elements I'd expect *less* than 10%. Not so for 256-bit elements.
> >
> > As to doing my own microbenchmark, may be, tomorrow.
> > It's a night here now.
> >
>
> So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers
> (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different
> access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.
>
> Here are results:
> 8-byte (64b) accesses:
> 0 1064
> 1 1170
> 2 1170
> 3 1170
> 4 1170
> 5 1170
> 6 1170
> 7 1170
> 8 1071
> 9 1171
> 10 1171
> 11 1171
> 12 1171
> 13 1171
> 14 1171
> 15 1171
> 16 1067
> 17 1170
> 18 1170
> 19 1170
> 20 1170
> 21 1170
> 22 1170
> 23 1170
> 24 1065
> 25 1170
> 26 1170
> 27 1170
> 28 1170
> 29 1170
> 30 1170
> 31 1170
>
> 16-byte (128b) accesses:
> 0 483
> 1 701
> 2 701
> 3 701
> 4 701
> 5 701
> 6 701
> 7 701
> 8 701
> 9 701
> 10 701
> 11 702
> 12 701
> 13 701
> 14 701
> 15 701
> 16 483
> 17 702
> 18 702
> 19 702
> 20 702
> 21 702
> 22 701
> 23 702
> 24 701
> 25 702
> 26 702
> 27 702
> 28 702
> 29 701
> 30 702
> 31 701
>
>
> 32-byte (256b) accesses:
> 0 256
> 1 468
> 2 468
> 3 468
> 4 468
> 5 468
> 6 468
> 7 468
> 8 468
> 9 468
> 10 468
> 11 468
> 12 468
> 13 468
> 14 468
> 15 468
> 16 468
> 17 468
> 18 468
> 19 468
> 20 468
> 21 468
> 22 468
> 23 468
> 24 468
> 25 468
> 26 468
> 27 468
> 28 468
> 29 468
> 30 468
> 31 468
>
> Misalignment penalty [of streaming add):
> 8-byte - 1.10x
> 16-byte - 1.45x
> 32-byte - 1.83x
>
> So, on this particular CPU the penalty is even bigger than what I expected.
> Quite possibly, on SKX with 512b accesses the penalty would be
> over 2x. Unfortunately, right now I have no access to SKX.
>
>
I think you should point out when the access cross a cache line or not.
> Michael S (already5chosen.delete@this.yahoo.com) on September 18, 2021 3:58 pm wrote:
> > Jörn Engel (joern.delete@this.purestorage.com) on September 18, 2021 2:01 pm wrote:
> > > Michael S (already5chosen.delete@this.yahoo.com) on September 17, 2021 7:48 am wrote:
> > > > Jörn Engel (joern.delete@this.purestorage.com) on September 17, 2021 5:42 am wrote:
> > > > >
> > > > > It won't. Unaligned access is a solved problem on any CPU
> > > > > that cares about performance. On Intel the difference
> > > > > between vmovdqu and vmovdqa on aligned data is zero - both
> > > > > instructions are equally fast. vmovdqu on unaligned
> > > > > data is maybe 10% slower than on aligned data, not a big deal either.
> > > >
> > > > 256-bit form is only 10% slower for L1D hit? May be, taken individually. But in the tight
> > > > loop, like in memcpy or integer variant of Stream Add, I'd expect it to be ~1.5x slower.
> > >
> > > Care to test your expectation? I tend to trust empirical results more than human expectations.
> > >
> > > Independent reproduction of my results:
> > > https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/
> >
> > For 32-bit and 64-bit data elements I'd also expect small penalty. On today's CPU
> > for 32-bit elements I'd expect *less* than 10%. Not so for 256-bit elements.
> >
> > As to doing my own microbenchmark, may be, tomorrow.
> > It's a night here now.
> >
>
> So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers
> (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different
> access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.
>
> Here are results:
> 8-byte (64b) accesses:
> 0 1064
> 1 1170
> 2 1170
> 3 1170
> 4 1170
> 5 1170
> 6 1170
> 7 1170
> 8 1071
> 9 1171
> 10 1171
> 11 1171
> 12 1171
> 13 1171
> 14 1171
> 15 1171
> 16 1067
> 17 1170
> 18 1170
> 19 1170
> 20 1170
> 21 1170
> 22 1170
> 23 1170
> 24 1065
> 25 1170
> 26 1170
> 27 1170
> 28 1170
> 29 1170
> 30 1170
> 31 1170
>
> 16-byte (128b) accesses:
> 0 483
> 1 701
> 2 701
> 3 701
> 4 701
> 5 701
> 6 701
> 7 701
> 8 701
> 9 701
> 10 701
> 11 702
> 12 701
> 13 701
> 14 701
> 15 701
> 16 483
> 17 702
> 18 702
> 19 702
> 20 702
> 21 702
> 22 701
> 23 702
> 24 701
> 25 702
> 26 702
> 27 702
> 28 702
> 29 701
> 30 702
> 31 701
>
>
> 32-byte (256b) accesses:
> 0 256
> 1 468
> 2 468
> 3 468
> 4 468
> 5 468
> 6 468
> 7 468
> 8 468
> 9 468
> 10 468
> 11 468
> 12 468
> 13 468
> 14 468
> 15 468
> 16 468
> 17 468
> 18 468
> 19 468
> 20 468
> 21 468
> 22 468
> 23 468
> 24 468
> 25 468
> 26 468
> 27 468
> 28 468
> 29 468
> 30 468
> 31 468
>
> Misalignment penalty [of streaming add):
> 8-byte - 1.10x
> 16-byte - 1.45x
> 32-byte - 1.83x
>
> So, on this particular CPU the penalty is even bigger than what I expected.
> Quite possibly, on SKX with 512b accesses the penalty would be
> over 2x. Unfortunately, right now I have no access to SKX.
>
>
I think you should point out when the access cross a cache line or not.