By: Andrey (andrey.semashev.delete@this.gmail.com), September 23, 2021 10:11 am
Room: Moderated Discussions
-.- (blarg.delete@this.mailinator.com) on September 22, 2021 5:56 pm wrote:
> Andrey (andrey.semashev.delete@this.gmail.com) on September 21, 2021 5:25 pm wrote:
> > I would think, such an implementation is unrealistic, unless the underlying memory transfer unit (e.g.
> > a cache line) is a multiple of 48 (for 384-bit vectors) or 80 (for 640-bit), or alignment is entirely
> > irrelevant (which is unrealistic in its own right). There is no sense in designing an actual hardware
> > where the vector size does not work well with other subsystems, memory subsystem in particular.
>
> Realistic hardware or not, SVE code must support such configurations to be fully spec compliant.
SVE is a spec, it simply allows a variety of implementations. It doesn't mean each and every one of them, down to pathological ones, will be implemented. But the code written for SVE will work, even in pathological cases. My point above was that there's no reason to worry about pathological cases because they will actually not exist in hardware.
> You can, of course, check the vector length up front, and refuse to run on widths that aren't a power
> of 2 (or perhaps choose not to bother with alignment in such cases, or fall back to something else),
> though that does go against the mantra of SVE being arbitrarily "scalable" as defined by ARM.
No, I'm not suggesting to hardcode a set of supported widths. However, the code could take into account the native vector length, including to achieve alignment - if this proves to be useful for performance on a given set of hardware implementations. As to how you would do this - there are many ways, starting from testing for a given set of CPU models that are known to benefit from this, and ending with all sorts of heuristics (e.g. check if the cache line size is a multiple of the native vector size). But doing this does not negate the fact that SVE is scalable, as the code will still work correctly on any vector size. It's just a potentially useful optimization.
I mean, take x86 for example. There were times when unaligned vector memory accesses were so slow that programmers had to go an extra mile to just avoid movdqu. And they still do in performance critical routines. There were CPUs that were very slow with certain instructions (e.g. pshufb in some Atoms and older AMD processors), but still supported them. Programs using those instructions were still correct, and possibly even faster than their scalar equivalents, but still optimizing the code in account for such hardware quirks allowed to improve performance, and therefore considered a useful thing to do.
> Andrey (andrey.semashev.delete@this.gmail.com) on September 21, 2021 5:25 pm wrote:
> > I would think, such an implementation is unrealistic, unless the underlying memory transfer unit (e.g.
> > a cache line) is a multiple of 48 (for 384-bit vectors) or 80 (for 640-bit), or alignment is entirely
> > irrelevant (which is unrealistic in its own right). There is no sense in designing an actual hardware
> > where the vector size does not work well with other subsystems, memory subsystem in particular.
>
> Realistic hardware or not, SVE code must support such configurations to be fully spec compliant.
SVE is a spec, it simply allows a variety of implementations. It doesn't mean each and every one of them, down to pathological ones, will be implemented. But the code written for SVE will work, even in pathological cases. My point above was that there's no reason to worry about pathological cases because they will actually not exist in hardware.
> You can, of course, check the vector length up front, and refuse to run on widths that aren't a power
> of 2 (or perhaps choose not to bother with alignment in such cases, or fall back to something else),
> though that does go against the mantra of SVE being arbitrarily "scalable" as defined by ARM.
No, I'm not suggesting to hardcode a set of supported widths. However, the code could take into account the native vector length, including to achieve alignment - if this proves to be useful for performance on a given set of hardware implementations. As to how you would do this - there are many ways, starting from testing for a given set of CPU models that are known to benefit from this, and ending with all sorts of heuristics (e.g. check if the cache line size is a multiple of the native vector size). But doing this does not negate the fact that SVE is scalable, as the code will still work correctly on any vector size. It's just a potentially useful optimization.
I mean, take x86 for example. There were times when unaligned vector memory accesses were so slow that programmers had to go an extra mile to just avoid movdqu. And they still do in performance critical routines. There were CPUs that were very slow with certain instructions (e.g. pshufb in some Atoms and older AMD processors), but still supported them. Programs using those instructions were still correct, and possibly even faster than their scalar equivalents, but still optimizing the code in account for such hardware quirks allowed to improve performance, and therefore considered a useful thing to do.