By: Doug S (foo.delete@this.bar.bar), May 16, 2022 12:59 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on May 15, 2022 1:01 pm wrote:
> Doug S (foo.delete@this.bar.bar) on May 15, 2022 10:50 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on May 14, 2022 12:27 pm wrote:
> > > Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 14, 2022 5:20 am wrote:
> > > > SVE2's big advantage over NEON is not wider vectors, but all the compiler-convenience features
> > > > it has that allow a compiler to be more aggressive about auto-vectorization. For people who
> > > > are hand-tuning codes for peak performance, SVE2 at 128 bit and NEON are about the same, but
> > > > SVE2 pulls ahead handily (due to the FFR register and associated instructions) when you're
> > > > writing "serial" code and relying on the compiler doing something sensible to it.
> > > >
> > > > You won't get the same performance this way as you would tuning
> > > > your code for 128 bit vectors, but it's still a win.
> > > >
> > >
> > > After looking at code, generated by LLVM autovectorizer last year, I am more that a somewhat
> > > doubtful. To say that a year ago they were bad would be an undeserving compliment.
> >
> >
> > From what I understand from someone who writes this type
> > of code (he's x86 focused so AVX not SVE or NEON) he
> > has to format his code just so to allow it to be properly
> > autovectorized. He learned by trial and error, checking
> > assembly output to figure out what the compiler expects and
> > write his code to match. When the compiler is updated,
> > he has to recheck to verify his carefully crafted code sequences still produce the desired effect.
> >
> > Sounds like it is better than writing directly in assembly, but not by much. And I doubt
> > most programmers go to such lengths. Most probably write code that could be auto vectorized
> > but is not, and they don't even know there is a lot of performance left on the table.
>
> The problems I had seen with LLVM/clang last year were MUCH worse than mere "it didn't vectorize
> potentially vectorizible loop". More common scenario was when compiler did vectorize a loop, but
> resulting code is slower (and much, much, much ... much, much bigger) than scalar variant.
> It happens to gcc too, but not nearly as often.
I don't know what compiler he is using offhand, but he develops almost entirely for Windows so probably Microsoft's.
> Doug S (foo.delete@this.bar.bar) on May 15, 2022 10:50 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on May 14, 2022 12:27 pm wrote:
> > > Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 14, 2022 5:20 am wrote:
> > > > SVE2's big advantage over NEON is not wider vectors, but all the compiler-convenience features
> > > > it has that allow a compiler to be more aggressive about auto-vectorization. For people who
> > > > are hand-tuning codes for peak performance, SVE2 at 128 bit and NEON are about the same, but
> > > > SVE2 pulls ahead handily (due to the FFR register and associated instructions) when you're
> > > > writing "serial" code and relying on the compiler doing something sensible to it.
> > > >
> > > > You won't get the same performance this way as you would tuning
> > > > your code for 128 bit vectors, but it's still a win.
> > > >
> > >
> > > After looking at code, generated by LLVM autovectorizer last year, I am more that a somewhat
> > > doubtful. To say that a year ago they were bad would be an undeserving compliment.
> >
> >
> > From what I understand from someone who writes this type
> > of code (he's x86 focused so AVX not SVE or NEON) he
> > has to format his code just so to allow it to be properly
> > autovectorized. He learned by trial and error, checking
> > assembly output to figure out what the compiler expects and
> > write his code to match. When the compiler is updated,
> > he has to recheck to verify his carefully crafted code sequences still produce the desired effect.
> >
> > Sounds like it is better than writing directly in assembly, but not by much. And I doubt
> > most programmers go to such lengths. Most probably write code that could be auto vectorized
> > but is not, and they don't even know there is a lot of performance left on the table.
>
> The problems I had seen with LLVM/clang last year were MUCH worse than mere "it didn't vectorize
> potentially vectorizible loop". More common scenario was when compiler did vectorize a loop, but
> resulting code is slower (and much, much, much ... much, much bigger) than scalar variant.
> It happens to gcc too, but not nearly as often.
I don't know what compiler he is using offhand, but he develops almost entirely for Windows so probably Microsoft's.