By: Michael S (already5chosen.delete@this.yahoo.com), May 14, 2022 12:27 pm
Room: Moderated Discussions
Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 14, 2022 5:20 am wrote:
> Doug S (foo.delete@this.bar.bar) on May 13, 2022 9:28 pm wrote:
> > --- (---.delete@this.redheron.com) on May 13, 2022 2:05 pm wrote:
> > > Doug S (foo.delete@this.bar.bar) on May 13, 2022 10:22 am wrote:
> > > > --- (---.delete@this.redheron.com) on May 13, 2022 9:33 am wrote:
> > > > > NEON (built into each core) uses 128b registers, and there are 4 NEON engines (so, in a handwaving
> > > > > sense) 512b of NEON capability per cycle per P core, 256b of NEON capability per cycle per E core.
> > > > > SVE2 is *probably* coming to Apple this year with A16 and M2, and will *probably*
> > > > > feature 256b wide registers. But if you are evaluating SVE2 based on register
> > > > > width, you're misunderstanding where the value of SVE2 lies.
> > > >
> > > >
> > > > I'm curious why you expect to see SVE2 added to Apple's cores?
> > > > What would be the benefit, when there is already
> > > > NEON and AMX? Or to reverse the question, what do you think Apple would be losing by not having SVE2?
> > > >
> > > > If ARM plans to eventually make SVE2 a required part of a future iteration of ARMv9 rather
> > > > than optional, it would make sense to add it sooner rather than later. That goes double
> > > > if ARM plans to someday deprecate NEON and make it optional in the future.
> > > >
> > > > If SVE2 is going to remain optional for the foreseeable future, and NEON mandatory,
> > > > it seems like Apple would be better off putting their resources into AMX.
> > >
> > > As I have a million times (but no-one pays any attention) SVE2 is not about wide
> > > vectors, it is about being a better compiler target for "general" loops.
> > > It will make Apple's CPUs faster, because it will allow more code to be vectorized,
> > > and vectorized at lower overhead, and that's why they will add it.
> > >
> > > AMX and SVE2 solve very different problems -- as I said in the first post.
> >
> >
> > What about NEON, which solves the exact same problem? SVE2 allows wider vectors, but unless
> > you actually ship with significantly wider vectors what's the difference between e.g. 2x256b
> > SVE2 and 4x128b NEON? Yeah SVE2 is where the development is taking place now, but most
> > of the new instructions are 'AI' related stuff Apple supports via the NPU.
> >
>
> SVE2's big advantage over NEON is not wider vectors, but all the compiler-convenience features
> it has that allow a compiler to be more aggressive about auto-vectorization. For people who
> are hand-tuning codes for peak performance, SVE2 at 128 bit and NEON are about the same, but
> SVE2 pulls ahead handily (due to the FFR register and associated instructions) when you're
> writing "serial" code and relying on the compiler doing something sensible to it.
>
> You won't get the same performance this way as you would tuning
> your code for 128 bit vectors, but it's still a win.
>
After looking at code, generated by LLVM autovectorizer last year, I am more that a somewhat doubtful. To say that a year ago they were bad would be an undeserving compliment.
"Horrendously stupid" describes a situation more adequately.
> > Now if they plan to offer 512 bit wide SVE2 on M2 Max for the higher end stuff while keeping it at
> > a more reasonable 128 or 256 bit width for phones, tablets and lower end Macs maybe it makes sense.
>
>
> Doug S (foo.delete@this.bar.bar) on May 13, 2022 9:28 pm wrote:
> > --- (---.delete@this.redheron.com) on May 13, 2022 2:05 pm wrote:
> > > Doug S (foo.delete@this.bar.bar) on May 13, 2022 10:22 am wrote:
> > > > --- (---.delete@this.redheron.com) on May 13, 2022 9:33 am wrote:
> > > > > NEON (built into each core) uses 128b registers, and there are 4 NEON engines (so, in a handwaving
> > > > > sense) 512b of NEON capability per cycle per P core, 256b of NEON capability per cycle per E core.
> > > > > SVE2 is *probably* coming to Apple this year with A16 and M2, and will *probably*
> > > > > feature 256b wide registers. But if you are evaluating SVE2 based on register
> > > > > width, you're misunderstanding where the value of SVE2 lies.
> > > >
> > > >
> > > > I'm curious why you expect to see SVE2 added to Apple's cores?
> > > > What would be the benefit, when there is already
> > > > NEON and AMX? Or to reverse the question, what do you think Apple would be losing by not having SVE2?
> > > >
> > > > If ARM plans to eventually make SVE2 a required part of a future iteration of ARMv9 rather
> > > > than optional, it would make sense to add it sooner rather than later. That goes double
> > > > if ARM plans to someday deprecate NEON and make it optional in the future.
> > > >
> > > > If SVE2 is going to remain optional for the foreseeable future, and NEON mandatory,
> > > > it seems like Apple would be better off putting their resources into AMX.
> > >
> > > As I have a million times (but no-one pays any attention) SVE2 is not about wide
> > > vectors, it is about being a better compiler target for "general" loops.
> > > It will make Apple's CPUs faster, because it will allow more code to be vectorized,
> > > and vectorized at lower overhead, and that's why they will add it.
> > >
> > > AMX and SVE2 solve very different problems -- as I said in the first post.
> >
> >
> > What about NEON, which solves the exact same problem? SVE2 allows wider vectors, but unless
> > you actually ship with significantly wider vectors what's the difference between e.g. 2x256b
> > SVE2 and 4x128b NEON? Yeah SVE2 is where the development is taking place now, but most
> > of the new instructions are 'AI' related stuff Apple supports via the NPU.
> >
>
> SVE2's big advantage over NEON is not wider vectors, but all the compiler-convenience features
> it has that allow a compiler to be more aggressive about auto-vectorization. For people who
> are hand-tuning codes for peak performance, SVE2 at 128 bit and NEON are about the same, but
> SVE2 pulls ahead handily (due to the FFR register and associated instructions) when you're
> writing "serial" code and relying on the compiler doing something sensible to it.
>
> You won't get the same performance this way as you would tuning
> your code for 128 bit vectors, but it's still a win.
>
After looking at code, generated by LLVM autovectorizer last year, I am more that a somewhat doubtful. To say that a year ago they were bad would be an undeserving compliment.
"Horrendously stupid" describes a situation more adequately.
> > Now if they plan to offer 512 bit wide SVE2 on M2 Max for the higher end stuff while keeping it at
> > a more reasonable 128 or 256 bit width for phones, tablets and lower end Macs maybe it makes sense.
>
>