By: -.- (blarg.delete@this.mailinator.com), September 27, 2021 9:06 pm
Room: Moderated Discussions
Kevin G (kevin.delete@this.cubitdesigns.com) on September 27, 2021 9:46 am wrote:
> Three and six element vectors are relatively common for 3D work. Early in the history of SIMD when it made
> sense to have a CPU code path for this, the code would simply round out to the next largest power of 2 vector
> size and live with the 25% inefficiency. (Three 32 floats would run on a 128 bit SIMD unit etc.) Nowadays
> such bulk work is done on GPUs where vector elements are decomposed and that 25% inefficiency is recovered.
I don't know the specifics of your example, but it sounds symptomatic of poor code design or memory layout. It sounds a lot like someone who took a scalar AoS design, then threw the x, y and z coordinates horizontally into a single vector, and patted themselves on the back for adopting SIMD. In reality, they probably should re-layout their data structures to use SoA, where a vector width being a multiple of 3/6 provides no intrinsic benefit.
If the data needs to be stored interleaved, NEON/SVE does conveniently provide the LD3 instruction to deinterleave such.
> Three and six element vectors are relatively common for 3D work. Early in the history of SIMD when it made
> sense to have a CPU code path for this, the code would simply round out to the next largest power of 2 vector
> size and live with the 25% inefficiency. (Three 32 floats would run on a 128 bit SIMD unit etc.) Nowadays
> such bulk work is done on GPUs where vector elements are decomposed and that 25% inefficiency is recovered.
I don't know the specifics of your example, but it sounds symptomatic of poor code design or memory layout. It sounds a lot like someone who took a scalar AoS design, then threw the x, y and z coordinates horizontally into a single vector, and patted themselves on the back for adopting SIMD. In reality, they probably should re-layout their data structures to use SoA, where a vector width being a multiple of 3/6 provides no intrinsic benefit.
If the data needs to be stored interleaved, NEON/SVE does conveniently provide the LD3 instruction to deinterleave such.