By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), June 2, 2022 11:09 pm
Room: Moderated Discussions
Heikki Kultala (heikk.i.kultal.a.delete@this.gmail.com) on June 2, 2022 10:40 am wrote:
> To me, the way how applications can copy with that seems messier than with SVE.
hm, is that necessarily the case? In our Highway backend for RVV, we're basically always setting avl to the max, and relying on masks to ensure observable behavior is correct for shorter counts (just like in SVE).
> > With LMUL 1 there is less usable architectural registers available because multiple reg are combined.
> So "Compile once, run anywhere" does not really allow changing
> that LMUL parameter based on the target processor.
Agreed, but we know there are 32 architectural regs, so we can choose LMUL based on the number of live variables in the kernel, right? It doesn't seem to me like adjusting LMUL based on the hardware would help.
> And LMUL makes things much more complicated for the implementation.
Sounds plausible, but I don't have much insight into that.
> SVE/SVE2 has nice instruction that have good purpose, and which is very well suited for code that
> really is vector-length agnostic without lots of hazzle. Easy for compiler to vectorize.
Agree about the nice instructions, the SVE backend of Highway was by far the easiest to write. Not sure about autovectorization though, time will tell ;)
> But the real hindurance for efficient implementation in RVV are the specifications of widening/narrowing
> instructions. On a relatively wide SIMD implementation, They severely break the lane boundaries,
> either needing an extra pipeline stage or few for data routing to reach reasonable clock rates.
> Because result may need to go from middle to end of SIMD registers.
Ah, true. I wish there was also something like x86 PSHUFB that stays within 16-byte lanes.
> To me, the way how applications can copy with that seems messier than with SVE.
hm, is that necessarily the case? In our Highway backend for RVV, we're basically always setting avl to the max, and relying on masks to ensure observable behavior is correct for shorter counts (just like in SVE).
> > With LMUL 1 there is less usable architectural registers available because multiple reg are combined.
> So "Compile once, run anywhere" does not really allow changing
> that LMUL parameter based on the target processor.
Agreed, but we know there are 32 architectural regs, so we can choose LMUL based on the number of live variables in the kernel, right? It doesn't seem to me like adjusting LMUL based on the hardware would help.
> And LMUL makes things much more complicated for the implementation.
Sounds plausible, but I don't have much insight into that.
> SVE/SVE2 has nice instruction that have good purpose, and which is very well suited for code that
> really is vector-length agnostic without lots of hazzle. Easy for compiler to vectorize.
Agree about the nice instructions, the SVE backend of Highway was by far the easiest to write. Not sure about autovectorization though, time will tell ;)
> But the real hindurance for efficient implementation in RVV are the specifications of widening/narrowing
> instructions. On a relatively wide SIMD implementation, They severely break the lane boundaries,
> either needing an extra pipeline stage or few for data routing to reach reasonable clock rates.
> Because result may need to go from middle to end of SIMD registers.
Ah, true. I wish there was also something like x86 PSHUFB that stays within 16-byte lanes.