By: Michael S (already5chosen.delete@this.yahoo.com), September 17, 2021 7:48 am
Room: Moderated Discussions
Jörn Engel (joern.delete@this.purestorage.com) on September 17, 2021 5:42 am wrote:
> Brett (ggtgp.delete@this.yahoo.com) on September 16, 2021 11:32 pm wrote:
> >
> > Intel will finally be forced to add a real memcpy for aligned data.
>
> It won't. Unaligned access is a solved problem on any CPU that cares about performance. On Intel the difference
> between vmovdqu and vmovdqa on aligned data is zero - both instructions are equally fast. vmovdqu on unaligned
> data is maybe 10% slower than on aligned data, not a big deal either.
256-bit form is only 10% slower for L1D hit? May be, taken individually. But in the tight loop, like in memcpy or integer variant of Stream Add, I'd expect it to be ~1.5x slower.
> The description of A64FX reads as if designers
> assumed 100% of memory accesses would be unaligned and require two cachelines instead of one.
>
> Adding a "no funny business" variant of the memcpy instruction may make sense for things
> like overlapping source/destination, absolutely. Removing all the checks for special
> cases is a big deal. But unaligned data is no longer a special case to check for.
> Brett (ggtgp.delete@this.yahoo.com) on September 16, 2021 11:32 pm wrote:
> >
> > Intel will finally be forced to add a real memcpy for aligned data.
>
> It won't. Unaligned access is a solved problem on any CPU that cares about performance. On Intel the difference
> between vmovdqu and vmovdqa on aligned data is zero - both instructions are equally fast. vmovdqu on unaligned
> data is maybe 10% slower than on aligned data, not a big deal either.
256-bit form is only 10% slower for L1D hit? May be, taken individually. But in the tight loop, like in memcpy or integer variant of Stream Add, I'd expect it to be ~1.5x slower.
> The description of A64FX reads as if designers
> assumed 100% of memory accesses would be unaligned and require two cachelines instead of one.
>
> Adding a "no funny business" variant of the memcpy instruction may make sense for things
> like overlapping source/destination, absolutely. Removing all the checks for special
> cases is a big deal. But unaligned data is no longer a special case to check for.