By: Heikki Kultala (heikk.i.kultal.a.delete@this.gmail.com), June 2, 2022 10:40 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 31, 2022 10:00 pm wrote:
> Heikki Kultala (heikki.kultal.a.delete@this.gmail.com) on May 31, 2022 8:59 am wrote:
> > What do you mean by claiming that RVV is designed for scaling down?
> > To me, it seems overly complex. Way too much state related to the vector configuration.
> First, RVV allows the implementation to choose the vector length and requires apps to cope with
> that. That saves a good deal of area compared to 512-bit. (This is also true of SVE/SVE2.)
To me, the way how applications can copy with that seems messier than with SVE.
> Second, there are some features that benefit simple single-issue architectures: in particular
> LMUL>1 and chaining so that multiple execution units can still be kept busy.
With LMUL > 1 there is less usable architectural registers available because multiple reg are combined.
So "Compile once, run anywhere" does not really allow changing that LMUL parameter based on the target processor.
And LMUL makes things much more complicated for the implementation.
> Third, the vector ISA is simpler (fewer instructions) than SVE and especially
> SVE2. I do wish RVV had more 2-arg permute/swizzle instructions, though.
With all those configuration parameters and state, RVV is very far from simple.
Complexity is not about number of instructions. It's more about data routing and state.
SVE/SVE2 has nice instruction that have good purpose, and which is very well suited for code that really is vector-length agnostic without lots of hazzle. Easy for compiler to vectorize.
RVV has all those of configuration parameters because somebody in the specification committee was a fan of some 1980's vector machines.
But the real hindurance for efficient implementation in RVV are the specifications of widening/narrowing instructions. On a relatively wide SIMD implementation, They severely break the lane boundaries, either needing an extra pipeline stage or few for data routing to reach reasonable clock rates. Because result may need to go from middle to end of SIMD registers.
On SVE, the narrowing instructions have padding in them so that they do not break lane boundaries badly (may only go to adjacent lane.
> Heikki Kultala (heikki.kultal.a.delete@this.gmail.com) on May 31, 2022 8:59 am wrote:
> > What do you mean by claiming that RVV is designed for scaling down?
> > To me, it seems overly complex. Way too much state related to the vector configuration.
> First, RVV allows the implementation to choose the vector length and requires apps to cope with
> that. That saves a good deal of area compared to 512-bit. (This is also true of SVE/SVE2.)
To me, the way how applications can copy with that seems messier than with SVE.
> Second, there are some features that benefit simple single-issue architectures: in particular
> LMUL>1 and chaining so that multiple execution units can still be kept busy.
With LMUL > 1 there is less usable architectural registers available because multiple reg are combined.
So "Compile once, run anywhere" does not really allow changing that LMUL parameter based on the target processor.
And LMUL makes things much more complicated for the implementation.
> Third, the vector ISA is simpler (fewer instructions) than SVE and especially
> SVE2. I do wish RVV had more 2-arg permute/swizzle instructions, though.
With all those configuration parameters and state, RVV is very far from simple.
Complexity is not about number of instructions. It's more about data routing and state.
SVE/SVE2 has nice instruction that have good purpose, and which is very well suited for code that really is vector-length agnostic without lots of hazzle. Easy for compiler to vectorize.
RVV has all those of configuration parameters because somebody in the specification committee was a fan of some 1980's vector machines.
But the real hindurance for efficient implementation in RVV are the specifications of widening/narrowing instructions. On a relatively wide SIMD implementation, They severely break the lane boundaries, either needing an extra pipeline stage or few for data routing to reach reasonable clock rates. Because result may need to go from middle to end of SIMD registers.
On SVE, the narrowing instructions have padding in them so that they do not break lane boundaries badly (may only go to adjacent lane.