By: Jörn Engel (joern.delete@this.purestorage.com), September 17, 2021 5:42 am
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on September 16, 2021 11:32 pm wrote:
>
> Intel will finally be forced to add a real memcpy for aligned data.
It won't. Unaligned access is a solved problem on any CPU that cares about performance. On Intel the difference between vmovdqu and vmovdqa on aligned data is zero - both instructions are equally fast. vmovdqu on unaligned data is maybe 10% slower than on aligned data, not a big deal either. The description of A64FX reads as if designers assumed 100% of memory accesses would be unaligned and require two cachelines instead of one.
Adding a "no funny business" variant of the memcpy instruction may make sense for things like overlapping source/destination, absolutely. Removing all the checks for special cases is a big deal. But unaligned data is no longer a special case to check for.
>
> Intel will finally be forced to add a real memcpy for aligned data.
It won't. Unaligned access is a solved problem on any CPU that cares about performance. On Intel the difference between vmovdqu and vmovdqa on aligned data is zero - both instructions are equally fast. vmovdqu on unaligned data is maybe 10% slower than on aligned data, not a big deal either. The description of A64FX reads as if designers assumed 100% of memory accesses would be unaligned and require two cachelines instead of one.
Adding a "no funny business" variant of the memcpy instruction may make sense for things like overlapping source/destination, absolutely. Removing all the checks for special cases is a big deal. But unaligned data is no longer a special case to check for.