Interesting ARM compiler data

By: --- (---.delete@this.redheron.com), August 9, 2022 10:15 am
Room: Moderated Discussions
noko (noko.delete@this.noko.com) on August 8, 2022 9:30 pm wrote:
> --- (---.delete@this.redheron.com) on August 8, 2022 2:54 pm wrote:
> > I found a lot interesting in this update:
> >
> > https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/arm-compiler-for-linux-and-arm-performance-libraries-22-0
> >
> > On the plus side, it's clear that there still remains some low-hanging fruit for the ARM ecosystem
> > when compared to the x86 ecosystem, even in fairly basic things like special functions and BLAS.
> >
> > On the negative side, the SVE results seem disappointing.
> > One can spin this in a few different ways (isolated
> > loops won't show the size improvements from simpler loops without head and tails, as opposed to real large
> > apps using large shared libraries; these are already vector-dense loops, whereas SVE should help more code
> > that's less trivially vectorizable), and clearly the SVE optimization has only just begun.
> > Still, somewhat disappointing results. (Except of course that himeno number. Given how
> > much Phoronix pushes himeno, I look forward to seeing Michael try to justify this!)
>
> If anything I think it shows the opposite - himeno has a large, easily vectorizable inner loop with contagious
> accesses. But looking at the autovectorizer output, it's doing 1.4 loads per vector op. Without actually
> running it on a Neoverse-V1, I'd hazard a guess that the NEON version is saturating the 5-wide decode. So
> it's a perfect use case for wider ALUs (and more load bandwidth), rather than fancy instructions.
>
> The only fancy instruction the SVE version really benefits from is reg+reg addressing
> in 256-bit wide contiguous vector loads; NEON's LDP only supports immediate offsets.
>
> > Perhaps hoping for better loops in generic code requires
> > SVE2 and we can't hope for much of that with just SVE?
>
> Which instructions are you thinking of? I don't see much in
> SVE2 that would be very useful to autovectorization / SPMD.

https://developer.arm.com/documentation/102340/0001/Introducing-SVE2
https://developer.arm.com/documentation/102340/0001/New-features-in-SVE2

I'm hoping that the closer match with "full" NEON will mean compilers will be able to more easily slot SVE2 into marginally complex integer, not primarily simple FP, loops.
This should get us more immediate bang, before future complexities like predication and scatter/gather are fully integrated into the compiler and so able to be used in rather more complex loops.

We'll see. The consensus seems to be that what we're seeing is as much limitations of V1/Graviton3 as of SVE per se (ie, much like Graviton1, you can view the SVE part of the design as primarily a test bed for developers, not as something yet ready to compete with the big boys).
Which, to be fair, is a smart sensible way to proceed. AMZ has done a reasonable job of not hyping SVE as the next coming of the Beatles; maybe with Graviton4 they won't feel they need to be as restrained?
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Interesting ARM compiler data---2022/08/08 02:54 PM
  Interesting ARM compiler datanoko2022/08/08 09:30 PM
    V1 bottleneckJan Wassenberg2022/08/09 12:38 AM
    Interesting ARM compiler data---2022/08/09 10:15 AM
      Interesting ARM compiler datanoko2022/08/09 11:34 AM
        Interesting ARM compiler dataJörn Engel2022/08/09 01:45 PM
        Interesting ARM compiler data---2022/08/09 01:49 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊