By: Michael S (already5chosen.delete@this.yahoo.com), May 15, 2022 1:01 pm
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on May 15, 2022 10:50 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on May 14, 2022 12:27 pm wrote:
> > Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 14, 2022 5:20 am wrote:
> > > SVE2's big advantage over NEON is not wider vectors, but all the compiler-convenience features
> > > it has that allow a compiler to be more aggressive about auto-vectorization. For people who
> > > are hand-tuning codes for peak performance, SVE2 at 128 bit and NEON are about the same, but
> > > SVE2 pulls ahead handily (due to the FFR register and associated instructions) when you're
> > > writing "serial" code and relying on the compiler doing something sensible to it.
> > >
> > > You won't get the same performance this way as you would tuning
> > > your code for 128 bit vectors, but it's still a win.
> > >
> >
> > After looking at code, generated by LLVM autovectorizer last year, I am more that a somewhat
> > doubtful. To say that a year ago they were bad would be an undeserving compliment.
>
>
> From what I understand from someone who writes this type of code (he's x86 focused so AVX not SVE or NEON) he
> has to format his code just so to allow it to be properly autovectorized. He learned by trial and error, checking
> assembly output to figure out what the compiler expects and write his code to match. When the compiler is updated,
> he has to recheck to verify his carefully crafted code sequences still produce the desired effect.
>
> Sounds like it is better than writing directly in assembly, but not by much. And I doubt
> most programmers go to such lengths. Most probably write code that could be auto vectorized
> but is not, and they don't even know there is a lot of performance left on the table.
The problems I had seen with LLVM/clang last year were MUCH worse than mere "it didn't vectorize potentially vectorizible loop". More common scenario was when compiler did vectorize a loop, but resulting code is slower (and much, much, much ... much, much bigger) than scalar variant.
It happens to gcc too, but not nearly as often.
> Michael S (already5chosen.delete@this.yahoo.com) on May 14, 2022 12:27 pm wrote:
> > Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 14, 2022 5:20 am wrote:
> > > SVE2's big advantage over NEON is not wider vectors, but all the compiler-convenience features
> > > it has that allow a compiler to be more aggressive about auto-vectorization. For people who
> > > are hand-tuning codes for peak performance, SVE2 at 128 bit and NEON are about the same, but
> > > SVE2 pulls ahead handily (due to the FFR register and associated instructions) when you're
> > > writing "serial" code and relying on the compiler doing something sensible to it.
> > >
> > > You won't get the same performance this way as you would tuning
> > > your code for 128 bit vectors, but it's still a win.
> > >
> >
> > After looking at code, generated by LLVM autovectorizer last year, I am more that a somewhat
> > doubtful. To say that a year ago they were bad would be an undeserving compliment.
>
>
> From what I understand from someone who writes this type of code (he's x86 focused so AVX not SVE or NEON) he
> has to format his code just so to allow it to be properly autovectorized. He learned by trial and error, checking
> assembly output to figure out what the compiler expects and write his code to match. When the compiler is updated,
> he has to recheck to verify his carefully crafted code sequences still produce the desired effect.
>
> Sounds like it is better than writing directly in assembly, but not by much. And I doubt
> most programmers go to such lengths. Most probably write code that could be auto vectorized
> but is not, and they don't even know there is a lot of performance left on the table.
The problems I had seen with LLVM/clang last year were MUCH worse than mere "it didn't vectorize potentially vectorizible loop". More common scenario was when compiler did vectorize a loop, but resulting code is slower (and much, much, much ... much, much bigger) than scalar variant.
It happens to gcc too, but not nearly as often.