By: noko (noko.delete@this.noko.com), August 8, 2022 9:30 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on August 8, 2022 2:54 pm wrote:
> I found a lot interesting in this update:
>
> https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/arm-compiler-for-linux-and-arm-performance-libraries-22-0
>
> On the plus side, it's clear that there still remains some low-hanging fruit for the ARM ecosystem
> when compared to the x86 ecosystem, even in fairly basic things like special functions and BLAS.
>
> On the negative side, the SVE results seem disappointing. One can spin this in a few different ways (isolated
> loops won't show the size improvements from simpler loops without head and tails, as opposed to real large
> apps using large shared libraries; these are already vector-dense loops, whereas SVE should help more code
> that's less trivially vectorizable), and clearly the SVE optimization has only just begun.
> Still, somewhat disappointing results. (Except of course that himeno number. Given how
> much Phoronix pushes himeno, I look forward to seeing Michael try to justify this!)
If anything I think it shows the opposite - himeno has a large, easily vectorizable inner loop with contagious accesses. But looking at the autovectorizer output, it's doing 1.4 loads per vector op. Without actually running it on a Neoverse-V1, I'd hazard a guess that the NEON version is saturating the 5-wide decode. So it's a perfect use case for wider ALUs (and more load bandwidth), rather than fancy instructions.
The only fancy instruction the SVE version really benefits from is reg+reg addressing in 256-bit wide contiguous vector loads; NEON's LDP only supports immediate offsets.
> Perhaps hoping for better loops in generic code requires
> SVE2 and we can't hope for much of that with just SVE?
Which instructions are you thinking of? I don't see much in SVE2 that would be very useful to autovectorization / SPMD.
> I found a lot interesting in this update:
>
> https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/arm-compiler-for-linux-and-arm-performance-libraries-22-0
>
> On the plus side, it's clear that there still remains some low-hanging fruit for the ARM ecosystem
> when compared to the x86 ecosystem, even in fairly basic things like special functions and BLAS.
>
> On the negative side, the SVE results seem disappointing. One can spin this in a few different ways (isolated
> loops won't show the size improvements from simpler loops without head and tails, as opposed to real large
> apps using large shared libraries; these are already vector-dense loops, whereas SVE should help more code
> that's less trivially vectorizable), and clearly the SVE optimization has only just begun.
> Still, somewhat disappointing results. (Except of course that himeno number. Given how
> much Phoronix pushes himeno, I look forward to seeing Michael try to justify this!)
If anything I think it shows the opposite - himeno has a large, easily vectorizable inner loop with contagious accesses. But looking at the autovectorizer output, it's doing 1.4 loads per vector op. Without actually running it on a Neoverse-V1, I'd hazard a guess that the NEON version is saturating the 5-wide decode. So it's a perfect use case for wider ALUs (and more load bandwidth), rather than fancy instructions.
The only fancy instruction the SVE version really benefits from is reg+reg addressing in 256-bit wide contiguous vector loads; NEON's LDP only supports immediate offsets.
> Perhaps hoping for better loops in generic code requires
> SVE2 and we can't hope for much of that with just SVE?
Which instructions are you thinking of? I don't see much in SVE2 that would be very useful to autovectorization / SPMD.
Topic | Posted By | Date |
---|---|---|
Interesting ARM compiler data | --- | 2022/08/08 02:54 PM |
Interesting ARM compiler data | noko | 2022/08/08 09:30 PM |
V1 bottleneck | Jan Wassenberg | 2022/08/09 12:38 AM |
Interesting ARM compiler data | --- | 2022/08/09 10:15 AM |
Interesting ARM compiler data | noko | 2022/08/09 11:34 AM |
Interesting ARM compiler data | Jörn Engel | 2022/08/09 01:45 PM |
Interesting ARM compiler data | --- | 2022/08/09 01:49 PM |