By: Michael S (already5chosen.delete@this.yahoo.com), September 18, 2021 3:58 pm
Room: Moderated Discussions
Jörn Engel (joern.delete@this.purestorage.com) on September 18, 2021 2:01 pm wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on September 17, 2021 7:48 am wrote:
> > Jörn Engel (joern.delete@this.purestorage.com) on September 17, 2021 5:42 am wrote:
> > >
> > > It won't. Unaligned access is a solved problem on any CPU
> > > that cares about performance. On Intel the difference
> > > between vmovdqu and vmovdqa on aligned data is zero - both
> > > instructions are equally fast. vmovdqu on unaligned
> > > data is maybe 10% slower than on aligned data, not a big deal either.
> >
> > 256-bit form is only 10% slower for L1D hit? May be, taken individually. But in the tight
> > loop, like in memcpy or integer variant of Stream Add, I'd expect it to be ~1.5x slower.
>
> Care to test your expectation? I tend to trust empirical results more than human expectations.
>
> Independent reproduction of my results:
> https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/
For 32-bit and 64-bit data elements I'd also expect small penalty. On today's CPU for 32-bit elements I'd expect *less* than 10%. Not so for 256-bit elements.
As to doing my own microbenchmark, may be, tomorrow.
It's a night here now.
> Michael S (already5chosen.delete@this.yahoo.com) on September 17, 2021 7:48 am wrote:
> > Jörn Engel (joern.delete@this.purestorage.com) on September 17, 2021 5:42 am wrote:
> > >
> > > It won't. Unaligned access is a solved problem on any CPU
> > > that cares about performance. On Intel the difference
> > > between vmovdqu and vmovdqa on aligned data is zero - both
> > > instructions are equally fast. vmovdqu on unaligned
> > > data is maybe 10% slower than on aligned data, not a big deal either.
> >
> > 256-bit form is only 10% slower for L1D hit? May be, taken individually. But in the tight
> > loop, like in memcpy or integer variant of Stream Add, I'd expect it to be ~1.5x slower.
>
> Care to test your expectation? I tend to trust empirical results more than human expectations.
>
> Independent reproduction of my results:
> https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/
For 32-bit and 64-bit data elements I'd also expect small penalty. On today's CPU for 32-bit elements I'd expect *less* than 10%. Not so for 256-bit elements.
As to doing my own microbenchmark, may be, tomorrow.
It's a night here now.