By: Simon Farnsworth (simon.delete@this.farnz.org.uk), September 24, 2021 2:47 am
Room: Moderated Discussions
-.- (blarg.delete@this.mailinator.com) on September 23, 2021 8:01 pm wrote:
> Andrey (andrey.semashev.delete@this.gmail.com) on September 23, 2021 10:11 am wrote:
> > SVE is a spec, it simply allows a variety of implementations. It doesn't mean each and
> > every one of them, down to pathological ones, will be implemented. But the code written
> > for SVE will work, even in pathological cases. My point above was that there's no reason
> > to worry about pathological cases because they will actually not exist in hardware.
>
> If they will never exist in hardware, why even allow it in the spec?
>
> > No, I'm not suggesting to hardcode a set of supported widths. However, the code could take into
> > account the native vector length, including to achieve alignment - if this proves to be useful for
> > performance on a given set of hardware implementations. As to how you would do this - there are
> > many ways, starting from testing for a given set of CPU models that are known to benefit from this,
> > and ending with all sorts of heuristics (e.g. check if the cache line size is a multiple of the
> > native vector size).
>
> That sounds very much like hard-coding for specific cases - whether you base it purely from vector
> width, or take other aspects into consideration. Which I think is fine, but it isn't and doesn't
> really suggest a generic approach - i.e. to use your example, what to do if the cache line size is
> not a multiple of the native vector size (especially when no such hardware currently exists)?
>
> > I mean, take x86 for example. There were times when unaligned vector memory accesses were so slow that
> > programmers had to go an extra mile to just avoid movdqu.
> > And they still do in performance critical routines.
> > There were CPUs that were very slow with certain instructions (e.g. pshufb in some Atoms and older AMD
> > processors), but still supported them. Programs using those instructions were still correct, and possibly
> > even faster than their scalar equivalents, but still optimizing the code in account for such hardware
> > quirks allowed to improve performance, and therefore considered a useful thing to do.
>
> x86 has the benefit of CPUID being able to disclose the CPU model. ARM doesn't really have any nice way of determining
> the model. And even if there was, heterogeneous core setups are much more common on ARM, often with substantially
> different core types, which complicates code which specifically tries to target one uarch.
>
> You can of course still try this type of optimisation with ARM, but, along with the diversity in the
> ARM ecosystem, the general impression I get is that trying these sorts of optimisations (along with
> trying to attain alignment) are much more difficult to pull off and often just isn't worth it.
Out of interest, why does the MIDR_EL1 register when combined with the REVIDR_EL1 register not provide enough information to identify the model? These are architectural registers in ARMv8-A, and in theory provide enough detail to identify not just the model but also the stepping of the chip you're running on.
Big difference to CPUID on x86 is that there's no string version, so you have to look up the name in an external lookup table.
> Andrey (andrey.semashev.delete@this.gmail.com) on September 23, 2021 10:11 am wrote:
> > SVE is a spec, it simply allows a variety of implementations. It doesn't mean each and
> > every one of them, down to pathological ones, will be implemented. But the code written
> > for SVE will work, even in pathological cases. My point above was that there's no reason
> > to worry about pathological cases because they will actually not exist in hardware.
>
> If they will never exist in hardware, why even allow it in the spec?
>
> > No, I'm not suggesting to hardcode a set of supported widths. However, the code could take into
> > account the native vector length, including to achieve alignment - if this proves to be useful for
> > performance on a given set of hardware implementations. As to how you would do this - there are
> > many ways, starting from testing for a given set of CPU models that are known to benefit from this,
> > and ending with all sorts of heuristics (e.g. check if the cache line size is a multiple of the
> > native vector size).
>
> That sounds very much like hard-coding for specific cases - whether you base it purely from vector
> width, or take other aspects into consideration. Which I think is fine, but it isn't and doesn't
> really suggest a generic approach - i.e. to use your example, what to do if the cache line size is
> not a multiple of the native vector size (especially when no such hardware currently exists)?
>
> > I mean, take x86 for example. There were times when unaligned vector memory accesses were so slow that
> > programmers had to go an extra mile to just avoid movdqu.
> > And they still do in performance critical routines.
> > There were CPUs that were very slow with certain instructions (e.g. pshufb in some Atoms and older AMD
> > processors), but still supported them. Programs using those instructions were still correct, and possibly
> > even faster than their scalar equivalents, but still optimizing the code in account for such hardware
> > quirks allowed to improve performance, and therefore considered a useful thing to do.
>
> x86 has the benefit of CPUID being able to disclose the CPU model. ARM doesn't really have any nice way of determining
> the model. And even if there was, heterogeneous core setups are much more common on ARM, often with substantially
> different core types, which complicates code which specifically tries to target one uarch.
>
> You can of course still try this type of optimisation with ARM, but, along with the diversity in the
> ARM ecosystem, the general impression I get is that trying these sorts of optimisations (along with
> trying to attain alignment) are much more difficult to pull off and often just isn't worth it.
Out of interest, why does the MIDR_EL1 register when combined with the REVIDR_EL1 register not provide enough information to identify the model? These are architectural registers in ARMv8-A, and in theory provide enough detail to identify not just the model but also the stepping of the chip you're running on.
Big difference to CPUID on x86 is that there's no string version, so you have to look up the name in an external lookup table.