--- ( on August 8, 2022 2:54 pm wrote:
> I found a lot interesting in this update:
> On the plus side, it's clear that there still remains some low-hanging fruit for the ARM ecosystem
> when compared to the x86 ecosystem, even in fairly basic things like special functions and BLAS.
> On the negative side, the SVE results seem disappointing. One can spin this in a few different ways (isolated
> loops won't show the size improvements from simpler loops without head and tails, as opposed to real large
> apps using large shared libraries; these are already vector-dense loops, whereas SVE should help more code
> that's less trivially vectorizable), and clearly the SVE optimization has only just begun.
> Still, somewhat disappointing results. (Except of course that himeno number. Given how
> much Phoronix pushes himeno, I look forward to seeing Michael try to justify this!)

If anything I think it shows the opposite - himeno has a large, easily vectorizable inner loop with contagious accesses. But looking at the autovectorizer output, it's doing 1.4 loads per vector op. Without actually running it on a Neoverse-V1, I'd hazard a guess that the NEON version is saturating the 5-wide decode. So it's a perfect use case for wider ALUs (and more load bandwidth), rather than fancy instructions.

The only fancy instruction the SVE version really benefits from is reg+reg addressing in 256-bit wide contiguous vector loads; NEON's LDP only supports immediate offsets.

> Perhaps hoping for better loops in generic code requires
> SVE2 and we can't hope for much of that with just SVE?

Which instructions are you thinking of? I don't see much in SVE2 that would be very useful to autovectorization / SPMD.
