By: dmcq (dmcq.delete@this.fano.co.uk), September 24, 2021 1:05 pm
Room: Moderated Discussions
Andrey (andrey.semashev.delete@this.gmail.com) on September 24, 2021 8:29 am wrote:
> -.- (blarg.delete@this.mailinator.com) on September 23, 2021 8:01 pm wrote:
> > Andrey (andrey.semashev.delete@this.gmail.com) on September 23, 2021 10:11 am wrote:
> > > SVE is a spec, it simply allows a variety of implementations. It doesn't mean each and
> > > every one of them, down to pathological ones, will be implemented. But the code written
> > > for SVE will work, even in pathological cases. My point above was that there's no reason
> > > to worry about pathological cases because they will actually not exist in hardware.
> >
> > If they will never exist in hardware, why even allow it in the spec?
>
> There may be valid reasons. The spec has to strike a balance between being overly restrictive and
> overly permissive. And it has a limited scope. As I said before, vector size is not chosen in isolation,
> it should be aligned with other subsystems, which SVE doesn't define. Should SVE prohibit 136-bit
> vectors? No, because your cache line and page sizes may be a multiple of 17. In such an implementation,
> 136-bit vectors would be a reasonable choice. But in the power-of-2 world we live in, 136-bit vectors
> are nonsensical (possibly, aside from some very special purpose hardware).
>
> > > No, I'm not suggesting to hardcode a set of supported widths. However, the code could take into
> > > account the native vector length, including to achieve alignment - if this proves to be useful for
> > > performance on a given set of hardware implementations. As to how you would do this - there are
> > > many ways, starting from testing for a given set of CPU models that are known to benefit from this,
> > > and ending with all sorts of heuristics (e.g. check if the cache line size is a multiple of the
> > > native vector size).
> >
> > That sounds very much like hard-coding for specific cases - whether you base it purely from vector
> > width, or take other aspects into consideration. Which I think is fine, but it isn't and doesn't
> > really suggest a generic approach - i.e. to use your example, what to do if the cache line size is
> > not a multiple of the native vector size (especially when no such hardware currently exists)?
>
> By default - nothing. I mean, so long as you're targeting an unknown abstract implementation,
> you may as well forget about alignment, instruction choices and scheduling and other micro-optimizations
> and just write the unaligned loop that maintains correctness.
>
> When you don't know the hardware, but can reasonably guess its properties (e.g. because
> literally every CPU on the planet behaves the way you expect), you can apply the optimizations
> that are considered generally useful, like aligning your memory accesses.
>
> When the hardware is weird, and you want to optimize for it, then you have to get to know it. You
> would read its optimization manuals and benchmark to see what makes it tick, and choose your optimizations
> accordingly. This includes weird cases like cache line not being a multiple of the native vector
> size - in this case you would test how the hardware behaves wrt. data alignment.
>
> > > I mean, take x86 for example. There were times when unaligned vector memory accesses were so slow that
> > > programmers had to go an extra mile to just avoid movdqu.
> > > And they still do in performance critical routines.
> > > There were CPUs that were very slow with certain instructions (e.g. pshufb in some Atoms and older AMD
> > > processors), but still supported them. Programs using those instructions were still correct, and possibly
> > > even faster than their scalar equivalents, but still optimizing the code in account for such hardware
> > > quirks allowed to improve performance, and therefore considered a useful thing to do.
> >
> > x86 has the benefit of CPUID being able to disclose the CPU model.
> > ARM doesn't really have any nice way of determining
> > the model. And even if there was, heterogeneous core setups
> > are much more common on ARM, often with substantially
> > different core types, which complicates code which specifically tries to target one uarch.
> >
> > You can of course still try this type of optimisation with ARM, but, along with the diversity in the
> > ARM ecosystem, the general impression I get is that trying these sorts of optimisations (along with
> > trying to attain alignment) are much more difficult to pull off and often just isn't worth it.
>
> Yes, heterogeneous cores are a pain, and it looks like x86 will follow suit in the near future. But
> optimization is a two way road. As much as software wants to run fast on existing hardware, the future
> hardware wants to run the existing software as fast as possible, too. So, having a CPU with half of
> the cores having 128-bit vectors and the other half - 136-bit ones would be a terrible idea because
> it would perform worse on the existing code (more precisely, half of the cores would perform worse).
> There is no sense in making such a CPU as opposed to a more traditional design with power-of-2 sizes.
> That is unless there is a very strong reason to have specifically 136-bit vectors, and even then a sane
> CPU designer would make everything possible to make sure 128-bit vectors still work fast.
>
> So, while different cores may be different, I would not expect them to be incompatibly different. Yes, I'm
> aware of some ARM CPUs that have different cache line sizes, but (a) they are multiple of one another (i.e.
> not quite incompatible) and (b) that already proved to be a PITA, so probably was not such a good idea.
>
> Side note about hybrid designs where different cores are radically different, like x86+ARM that AMD did.
> That sort of combination is a somewhat different story. The cores in such a hybrid are inherently incompatible,
> but that's not a problem. A thread that is running on one kind of core could never run on the other kind
> of core because of the different instruction sets. But even then, both kinds of cores still interface the
> same system memory and IO, possibly via the same hardware blocks in the CPU, so they are not completely unrelated.
> So, if x86 cores follow the power-of-2 design, ARM cores will most likely have to as well.
SVE is in multiples of 128 bits so not so bad! I' guess the first hetrogenous system with a size greater than 128 bits will be an Apple one and I guess they'l go for having the same size in both, perhaps they'll share an SVE unit amongst the small cores like ARM. But they haven't even announced a system with SVE yet.
> -.- (blarg.delete@this.mailinator.com) on September 23, 2021 8:01 pm wrote:
> > Andrey (andrey.semashev.delete@this.gmail.com) on September 23, 2021 10:11 am wrote:
> > > SVE is a spec, it simply allows a variety of implementations. It doesn't mean each and
> > > every one of them, down to pathological ones, will be implemented. But the code written
> > > for SVE will work, even in pathological cases. My point above was that there's no reason
> > > to worry about pathological cases because they will actually not exist in hardware.
> >
> > If they will never exist in hardware, why even allow it in the spec?
>
> There may be valid reasons. The spec has to strike a balance between being overly restrictive and
> overly permissive. And it has a limited scope. As I said before, vector size is not chosen in isolation,
> it should be aligned with other subsystems, which SVE doesn't define. Should SVE prohibit 136-bit
> vectors? No, because your cache line and page sizes may be a multiple of 17. In such an implementation,
> 136-bit vectors would be a reasonable choice. But in the power-of-2 world we live in, 136-bit vectors
> are nonsensical (possibly, aside from some very special purpose hardware).
>
> > > No, I'm not suggesting to hardcode a set of supported widths. However, the code could take into
> > > account the native vector length, including to achieve alignment - if this proves to be useful for
> > > performance on a given set of hardware implementations. As to how you would do this - there are
> > > many ways, starting from testing for a given set of CPU models that are known to benefit from this,
> > > and ending with all sorts of heuristics (e.g. check if the cache line size is a multiple of the
> > > native vector size).
> >
> > That sounds very much like hard-coding for specific cases - whether you base it purely from vector
> > width, or take other aspects into consideration. Which I think is fine, but it isn't and doesn't
> > really suggest a generic approach - i.e. to use your example, what to do if the cache line size is
> > not a multiple of the native vector size (especially when no such hardware currently exists)?
>
> By default - nothing. I mean, so long as you're targeting an unknown abstract implementation,
> you may as well forget about alignment, instruction choices and scheduling and other micro-optimizations
> and just write the unaligned loop that maintains correctness.
>
> When you don't know the hardware, but can reasonably guess its properties (e.g. because
> literally every CPU on the planet behaves the way you expect), you can apply the optimizations
> that are considered generally useful, like aligning your memory accesses.
>
> When the hardware is weird, and you want to optimize for it, then you have to get to know it. You
> would read its optimization manuals and benchmark to see what makes it tick, and choose your optimizations
> accordingly. This includes weird cases like cache line not being a multiple of the native vector
> size - in this case you would test how the hardware behaves wrt. data alignment.
>
> > > I mean, take x86 for example. There were times when unaligned vector memory accesses were so slow that
> > > programmers had to go an extra mile to just avoid movdqu.
> > > And they still do in performance critical routines.
> > > There were CPUs that were very slow with certain instructions (e.g. pshufb in some Atoms and older AMD
> > > processors), but still supported them. Programs using those instructions were still correct, and possibly
> > > even faster than their scalar equivalents, but still optimizing the code in account for such hardware
> > > quirks allowed to improve performance, and therefore considered a useful thing to do.
> >
> > x86 has the benefit of CPUID being able to disclose the CPU model.
> > ARM doesn't really have any nice way of determining
> > the model. And even if there was, heterogeneous core setups
> > are much more common on ARM, often with substantially
> > different core types, which complicates code which specifically tries to target one uarch.
> >
> > You can of course still try this type of optimisation with ARM, but, along with the diversity in the
> > ARM ecosystem, the general impression I get is that trying these sorts of optimisations (along with
> > trying to attain alignment) are much more difficult to pull off and often just isn't worth it.
>
> Yes, heterogeneous cores are a pain, and it looks like x86 will follow suit in the near future. But
> optimization is a two way road. As much as software wants to run fast on existing hardware, the future
> hardware wants to run the existing software as fast as possible, too. So, having a CPU with half of
> the cores having 128-bit vectors and the other half - 136-bit ones would be a terrible idea because
> it would perform worse on the existing code (more precisely, half of the cores would perform worse).
> There is no sense in making such a CPU as opposed to a more traditional design with power-of-2 sizes.
> That is unless there is a very strong reason to have specifically 136-bit vectors, and even then a sane
> CPU designer would make everything possible to make sure 128-bit vectors still work fast.
>
> So, while different cores may be different, I would not expect them to be incompatibly different. Yes, I'm
> aware of some ARM CPUs that have different cache line sizes, but (a) they are multiple of one another (i.e.
> not quite incompatible) and (b) that already proved to be a PITA, so probably was not such a good idea.
>
> Side note about hybrid designs where different cores are radically different, like x86+ARM that AMD did.
> That sort of combination is a somewhat different story. The cores in such a hybrid are inherently incompatible,
> but that's not a problem. A thread that is running on one kind of core could never run on the other kind
> of core because of the different instruction sets. But even then, both kinds of cores still interface the
> same system memory and IO, possibly via the same hardware blocks in the CPU, so they are not completely unrelated.
> So, if x86 cores follow the power-of-2 design, ARM cores will most likely have to as well.
SVE is in multiples of 128 bits so not so bad! I' guess the first hetrogenous system with a size greater than 128 bits will be an Apple one and I guess they'l go for having the same size in both, perhaps they'll share an SVE unit amongst the small cores like ARM. But they haven't even announced a system with SVE yet.