By: Andrey (andrey.semashev.delete@this.gmail.com), September 24, 2021 8:29 am
Room: Moderated Discussions
-.- (blarg.delete@this.mailinator.com) on September 23, 2021 8:01 pm wrote:
> Andrey (andrey.semashev.delete@this.gmail.com) on September 23, 2021 10:11 am wrote:
> > SVE is a spec, it simply allows a variety of implementations. It doesn't mean each and
> > every one of them, down to pathological ones, will be implemented. But the code written
> > for SVE will work, even in pathological cases. My point above was that there's no reason
> > to worry about pathological cases because they will actually not exist in hardware.
>
> If they will never exist in hardware, why even allow it in the spec?
There may be valid reasons. The spec has to strike a balance between being overly restrictive and overly permissive. And it has a limited scope. As I said before, vector size is not chosen in isolation, it should be aligned with other subsystems, which SVE doesn't define. Should SVE prohibit 136-bit vectors? No, because your cache line and page sizes may be a multiple of 17. In such an implementation, 136-bit vectors would be a reasonable choice. But in the power-of-2 world we live in, 136-bit vectors are nonsensical (possibly, aside from some very special purpose hardware).
> > No, I'm not suggesting to hardcode a set of supported widths. However, the code could take into
> > account the native vector length, including to achieve alignment - if this proves to be useful for
> > performance on a given set of hardware implementations. As to how you would do this - there are
> > many ways, starting from testing for a given set of CPU models that are known to benefit from this,
> > and ending with all sorts of heuristics (e.g. check if the cache line size is a multiple of the
> > native vector size).
>
> That sounds very much like hard-coding for specific cases - whether you base it purely from vector
> width, or take other aspects into consideration. Which I think is fine, but it isn't and doesn't
> really suggest a generic approach - i.e. to use your example, what to do if the cache line size is
> not a multiple of the native vector size (especially when no such hardware currently exists)?
By default - nothing. I mean, so long as you're targeting an unknown abstract implementation, you may as well forget about alignment, instruction choices and scheduling and other micro-optimizations and just write the unaligned loop that maintains correctness.
When you don't know the hardware, but can reasonably guess its properties (e.g. because literally every CPU on the planet behaves the way you expect), you can apply the optimizations that are considered generally useful, like aligning your memory accesses.
When the hardware is weird, and you want to optimize for it, then you have to get to know it. You would read its optimization manuals and benchmark to see what makes it tick, and choose your optimizations accordingly. This includes weird cases like cache line not being a multiple of the native vector size - in this case you would test how the hardware behaves wrt. data alignment.
> > I mean, take x86 for example. There were times when unaligned vector memory accesses were so slow that
> > programmers had to go an extra mile to just avoid movdqu.
> > And they still do in performance critical routines.
> > There were CPUs that were very slow with certain instructions (e.g. pshufb in some Atoms and older AMD
> > processors), but still supported them. Programs using those instructions were still correct, and possibly
> > even faster than their scalar equivalents, but still optimizing the code in account for such hardware
> > quirks allowed to improve performance, and therefore considered a useful thing to do.
>
> x86 has the benefit of CPUID being able to disclose the CPU model. ARM doesn't really have any nice way of determining
> the model. And even if there was, heterogeneous core setups are much more common on ARM, often with substantially
> different core types, which complicates code which specifically tries to target one uarch.
>
> You can of course still try this type of optimisation with ARM, but, along with the diversity in the
> ARM ecosystem, the general impression I get is that trying these sorts of optimisations (along with
> trying to attain alignment) are much more difficult to pull off and often just isn't worth it.
Yes, heterogeneous cores are a pain, and it looks like x86 will follow suit in the near future. But optimization is a two way road. As much as software wants to run fast on existing hardware, the future hardware wants to run the existing software as fast as possible, too. So, having a CPU with half of the cores having 128-bit vectors and the other half - 136-bit ones would be a terrible idea because it would perform worse on the existing code (more precisely, half of the cores would perform worse). There is no sense in making such a CPU as opposed to a more traditional design with power-of-2 sizes. That is unless there is a very strong reason to have specifically 136-bit vectors, and even then a sane CPU designer would make everything possible to make sure 128-bit vectors still work fast.
So, while different cores may be different, I would not expect them to be incompatibly different. Yes, I'm aware of some ARM CPUs that have different cache line sizes, but (a) they are multiple of one another (i.e. not quite incompatible) and (b) that already proved to be a PITA, so probably was not such a good idea.
Side note about hybrid designs where different cores are radically different, like x86+ARM that AMD did. That sort of combination is a somewhat different story. The cores in such a hybrid are inherently incompatible, but that's not a problem. A thread that is running on one kind of core could never run on the other kind of core because of the different instruction sets. But even then, both kinds of cores still interface the same system memory and IO, possibly via the same hardware blocks in the CPU, so they are not completely unrelated. So, if x86 cores follow the power-of-2 design, ARM cores will most likely have to as well.
> Andrey (andrey.semashev.delete@this.gmail.com) on September 23, 2021 10:11 am wrote:
> > SVE is a spec, it simply allows a variety of implementations. It doesn't mean each and
> > every one of them, down to pathological ones, will be implemented. But the code written
> > for SVE will work, even in pathological cases. My point above was that there's no reason
> > to worry about pathological cases because they will actually not exist in hardware.
>
> If they will never exist in hardware, why even allow it in the spec?
There may be valid reasons. The spec has to strike a balance between being overly restrictive and overly permissive. And it has a limited scope. As I said before, vector size is not chosen in isolation, it should be aligned with other subsystems, which SVE doesn't define. Should SVE prohibit 136-bit vectors? No, because your cache line and page sizes may be a multiple of 17. In such an implementation, 136-bit vectors would be a reasonable choice. But in the power-of-2 world we live in, 136-bit vectors are nonsensical (possibly, aside from some very special purpose hardware).
> > No, I'm not suggesting to hardcode a set of supported widths. However, the code could take into
> > account the native vector length, including to achieve alignment - if this proves to be useful for
> > performance on a given set of hardware implementations. As to how you would do this - there are
> > many ways, starting from testing for a given set of CPU models that are known to benefit from this,
> > and ending with all sorts of heuristics (e.g. check if the cache line size is a multiple of the
> > native vector size).
>
> That sounds very much like hard-coding for specific cases - whether you base it purely from vector
> width, or take other aspects into consideration. Which I think is fine, but it isn't and doesn't
> really suggest a generic approach - i.e. to use your example, what to do if the cache line size is
> not a multiple of the native vector size (especially when no such hardware currently exists)?
By default - nothing. I mean, so long as you're targeting an unknown abstract implementation, you may as well forget about alignment, instruction choices and scheduling and other micro-optimizations and just write the unaligned loop that maintains correctness.
When you don't know the hardware, but can reasonably guess its properties (e.g. because literally every CPU on the planet behaves the way you expect), you can apply the optimizations that are considered generally useful, like aligning your memory accesses.
When the hardware is weird, and you want to optimize for it, then you have to get to know it. You would read its optimization manuals and benchmark to see what makes it tick, and choose your optimizations accordingly. This includes weird cases like cache line not being a multiple of the native vector size - in this case you would test how the hardware behaves wrt. data alignment.
> > I mean, take x86 for example. There were times when unaligned vector memory accesses were so slow that
> > programmers had to go an extra mile to just avoid movdqu.
> > And they still do in performance critical routines.
> > There were CPUs that were very slow with certain instructions (e.g. pshufb in some Atoms and older AMD
> > processors), but still supported them. Programs using those instructions were still correct, and possibly
> > even faster than their scalar equivalents, but still optimizing the code in account for such hardware
> > quirks allowed to improve performance, and therefore considered a useful thing to do.
>
> x86 has the benefit of CPUID being able to disclose the CPU model. ARM doesn't really have any nice way of determining
> the model. And even if there was, heterogeneous core setups are much more common on ARM, often with substantially
> different core types, which complicates code which specifically tries to target one uarch.
>
> You can of course still try this type of optimisation with ARM, but, along with the diversity in the
> ARM ecosystem, the general impression I get is that trying these sorts of optimisations (along with
> trying to attain alignment) are much more difficult to pull off and often just isn't worth it.
Yes, heterogeneous cores are a pain, and it looks like x86 will follow suit in the near future. But optimization is a two way road. As much as software wants to run fast on existing hardware, the future hardware wants to run the existing software as fast as possible, too. So, having a CPU with half of the cores having 128-bit vectors and the other half - 136-bit ones would be a terrible idea because it would perform worse on the existing code (more precisely, half of the cores would perform worse). There is no sense in making such a CPU as opposed to a more traditional design with power-of-2 sizes. That is unless there is a very strong reason to have specifically 136-bit vectors, and even then a sane CPU designer would make everything possible to make sure 128-bit vectors still work fast.
So, while different cores may be different, I would not expect them to be incompatibly different. Yes, I'm aware of some ARM CPUs that have different cache line sizes, but (a) they are multiple of one another (i.e. not quite incompatible) and (b) that already proved to be a PITA, so probably was not such a good idea.
Side note about hybrid designs where different cores are radically different, like x86+ARM that AMD did. That sort of combination is a somewhat different story. The cores in such a hybrid are inherently incompatible, but that's not a problem. A thread that is running on one kind of core could never run on the other kind of core because of the different instruction sets. But even then, both kinds of cores still interface the same system memory and IO, possibly via the same hardware blocks in the CPU, so they are not completely unrelated. So, if x86 cores follow the power-of-2 design, ARM cores will most likely have to as well.