Interesting ARM compiler data

By: noko (noko.delete@this.noko.com), August 9, 2022 11:34 am
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on August 9, 2022 10:15 am wrote:
> noko (noko.delete@this.noko.com) on August 8, 2022 9:30 pm wrote:
> > --- (---.delete@this.redheron.com) on August 8, 2022 2:54 pm wrote:
> > > I found a lot interesting in this update:
> > >
> > > https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/arm-compiler-for-linux-and-arm-performance-libraries-22-0
> > >
> > > On the plus side, it's clear that there still remains some low-hanging fruit for the ARM ecosystem
> > > when compared to the x86 ecosystem, even in fairly basic things like special functions and BLAS.
> > >
> > > On the negative side, the SVE results seem disappointing.
> > > One can spin this in a few different ways (isolated
> > > loops won't show the size improvements from simpler loops without head and tails, as opposed to real large
> > > apps using large shared libraries; these are already vector-dense loops, whereas SVE should help more code
> > > that's less trivially vectorizable), and clearly the SVE optimization has only just begun.
> > > Still, somewhat disappointing results. (Except of course that himeno number. Given how
> > > much Phoronix pushes himeno, I look forward to seeing Michael try to justify this!)
> >
> > If anything I think it shows the opposite - himeno has a
> > large, easily vectorizable inner loop with contagious
> > accesses. But looking at the autovectorizer output, it's doing 1.4 loads per vector op. Without actually
> > running it on a Neoverse-V1, I'd hazard a guess that the NEON version is saturating the 5-wide decode. So
> > it's a perfect use case for wider ALUs (and more load bandwidth), rather than fancy instructions.
> >
> > The only fancy instruction the SVE version really benefits from is reg+reg addressing
> > in 256-bit wide contiguous vector loads; NEON's LDP only supports immediate offsets.
> >
> > > Perhaps hoping for better loops in generic code requires
> > > SVE2 and we can't hope for much of that with just SVE?
> >
> > Which instructions are you thinking of? I don't see much in
> > SVE2 that would be very useful to autovectorization / SPMD.
>
> https://developer.arm.com/documentation/102340/0001/Introducing-SVE2
> https://developer.arm.com/documentation/102340/0001/New-features-in-SVE2
>
> I'm hoping that the closer match with "full" NEON will mean compilers will be able to more
> easily slot SVE2 into marginally complex integer, not primarily simple FP, loops.
> This should get us more immediate bang, before future complexities like predication and scatter/gather
> are fully integrated into the compiler and so able to be used in rather more complex loops.

The bits of NEON "missing" from SVE were specialized for integer DSP optimized specifically for SIMD on types narrower than 32-bits, like saturating arithmetic or (a+b+1)>>1, or SAD, or combined shift+round+narrow, or pairwise arithmetic.

Given the semantics of C, I don't expect any of that is easily emitted by a compiler without using intrinsics, which is why I suspect they didn't make the cut for SVE in the first place.

They did add a WHILE instruction for reversed loop indexes, and also a non-destructive integer multiply, but I really don't think the rest of SVE2 is easily generable from C loops...
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Interesting ARM compiler data---2022/08/08 02:54 PM
  Interesting ARM compiler datanoko2022/08/08 09:30 PM
    V1 bottleneckJan Wassenberg2022/08/09 12:38 AM
    Interesting ARM compiler data---2022/08/09 10:15 AM
      Interesting ARM compiler datanoko2022/08/09 11:34 AM
        Interesting ARM compiler dataJörn Engel2022/08/09 01:45 PM
        Interesting ARM compiler data---2022/08/09 01:49 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊