By: Jörn Engel (joern.delete@this.purestorage.com), September 18, 2021 2:01 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on September 17, 2021 7:48 am wrote:
> Jörn Engel (joern.delete@this.purestorage.com) on September 17, 2021 5:42 am wrote:
> >
> > It won't. Unaligned access is a solved problem on any CPU
> > that cares about performance. On Intel the difference
> > between vmovdqu and vmovdqa on aligned data is zero - both
> > instructions are equally fast. vmovdqu on unaligned
> > data is maybe 10% slower than on aligned data, not a big deal either.
>
> 256-bit form is only 10% slower for L1D hit? May be, taken individually. But in the tight
> loop, like in memcpy or integer variant of Stream Add, I'd expect it to be ~1.5x slower.
Care to test your expectation? I tend to trust empirical results more than human expectations.
Independent reproduction of my results:
https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/
> Jörn Engel (joern.delete@this.purestorage.com) on September 17, 2021 5:42 am wrote:
> >
> > It won't. Unaligned access is a solved problem on any CPU
> > that cares about performance. On Intel the difference
> > between vmovdqu and vmovdqa on aligned data is zero - both
> > instructions are equally fast. vmovdqu on unaligned
> > data is maybe 10% slower than on aligned data, not a big deal either.
>
> 256-bit form is only 10% slower for L1D hit? May be, taken individually. But in the tight
> loop, like in memcpy or integer variant of Stream Add, I'd expect it to be ~1.5x slower.
Care to test your expectation? I tend to trust empirical results more than human expectations.
Independent reproduction of my results:
https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/