By: dmcq (dmcq.delete@this.fano.co.uk), September 28, 2021 2:29 pm
Room: Moderated Discussions
Andrey (andrey.semashev.delete@this.gmail.com) on September 28, 2021 12:12 pm wrote:
> Jukka Larja (roskakori2006.delete@this.gmail.com) on September 28, 2021 6:37 am wrote:
> > -.- (blarg.delete@this.mailinator.com) on September 27, 2021 9:06 pm wrote:
> > > Kevin G (kevin.delete@this.cubitdesigns.com) on September 27, 2021 9:46 am wrote:
> > > > Three and six element vectors are relatively common for 3D work. Early in the history of SIMD when it made
> > > > sense to have a CPU code path for this, the code would simply
> > > > round out to the next largest power of 2 vector
> > > > size and live with the 25% inefficiency. (Three 32 floats would run on a 128 bit SIMD unit etc.) Nowadays
> > > > such bulk work is done on GPUs where vector elements are decomposed and that 25% inefficiency is recovered.
> > >
> > > I don't know the specifics of your example, but it sounds symptomatic of poor code design
> > > or memory layout. It sounds a lot like someone who took a scalar AoS design, then threw
> > > the x, y and z coordinates horizontally into a single vector, and patted themselves on the
> > > back for adopting SIMD. In reality, they probably should re-layout their data structures
> > > to use SoA, where a vector width being a multiple of 3/6 provides no intrinsic benefit.
> >
> > I've never actually seen this done in any game, but I've seen plenty of libs offering such
> > easy to use option to get some extra performance from 4-wide SIMD. I've created such lib
> > myself and got decent performance improvement for one use case (something like twice the
> > performance) and a regression for another (probably due to 25 % cache waste).
> >
> > The thing is, going from AoS to SoA is often completely unrealistic. For myself, it's often about having
> > one or couple thingies that require some vector math to update. It's a chain of Vec3 operations, with
> > plenty of ifs sprinkled around, repeated couple of times. Not dozens or hundreds of times, as would
> > be needed to make SoA model make sense (also, this case is obviously no good for GPGPU).
>
> As I understand, the uses of SIMD where the amount of data is relatively small is not the target usage for
> SVE, which is geared towards processing large amounts of data in a loop. And with large amounts of data,
> you do want to convert to SoA, even if on the fly during the loop, because data density is key here.
I think it is more for intermediate sizes, and the long lengths will be tackled by the streaming form that they want to introduce with the scalable matrix extension.
> Jukka Larja (roskakori2006.delete@this.gmail.com) on September 28, 2021 6:37 am wrote:
> > -.- (blarg.delete@this.mailinator.com) on September 27, 2021 9:06 pm wrote:
> > > Kevin G (kevin.delete@this.cubitdesigns.com) on September 27, 2021 9:46 am wrote:
> > > > Three and six element vectors are relatively common for 3D work. Early in the history of SIMD when it made
> > > > sense to have a CPU code path for this, the code would simply
> > > > round out to the next largest power of 2 vector
> > > > size and live with the 25% inefficiency. (Three 32 floats would run on a 128 bit SIMD unit etc.) Nowadays
> > > > such bulk work is done on GPUs where vector elements are decomposed and that 25% inefficiency is recovered.
> > >
> > > I don't know the specifics of your example, but it sounds symptomatic of poor code design
> > > or memory layout. It sounds a lot like someone who took a scalar AoS design, then threw
> > > the x, y and z coordinates horizontally into a single vector, and patted themselves on the
> > > back for adopting SIMD. In reality, they probably should re-layout their data structures
> > > to use SoA, where a vector width being a multiple of 3/6 provides no intrinsic benefit.
> >
> > I've never actually seen this done in any game, but I've seen plenty of libs offering such
> > easy to use option to get some extra performance from 4-wide SIMD. I've created such lib
> > myself and got decent performance improvement for one use case (something like twice the
> > performance) and a regression for another (probably due to 25 % cache waste).
> >
> > The thing is, going from AoS to SoA is often completely unrealistic. For myself, it's often about having
> > one or couple thingies that require some vector math to update. It's a chain of Vec3 operations, with
> > plenty of ifs sprinkled around, repeated couple of times. Not dozens or hundreds of times, as would
> > be needed to make SoA model make sense (also, this case is obviously no good for GPGPU).
>
> As I understand, the uses of SIMD where the amount of data is relatively small is not the target usage for
> SVE, which is geared towards processing large amounts of data in a loop. And with large amounts of data,
> you do want to convert to SoA, even if on the fly during the loop, because data density is key here.
I think it is more for intermediate sizes, and the long lengths will be tackled by the streaming form that they want to introduce with the scalable matrix extension.