By: Jukka Larja (roskakori2006.delete@this.gmail.com), September 28, 2021 6:37 am
Room: Moderated Discussions
-.- (blarg.delete@this.mailinator.com) on September 27, 2021 9:06 pm wrote:
> Kevin G (kevin.delete@this.cubitdesigns.com) on September 27, 2021 9:46 am wrote:
> > Three and six element vectors are relatively common for 3D work. Early in the history of SIMD when it made
> > sense to have a CPU code path for this, the code would simply
> > round out to the next largest power of 2 vector
> > size and live with the 25% inefficiency. (Three 32 floats would run on a 128 bit SIMD unit etc.) Nowadays
> > such bulk work is done on GPUs where vector elements are decomposed and that 25% inefficiency is recovered.
>
> I don't know the specifics of your example, but it sounds symptomatic of poor code design
> or memory layout. It sounds a lot like someone who took a scalar AoS design, then threw
> the x, y and z coordinates horizontally into a single vector, and patted themselves on the
> back for adopting SIMD. In reality, they probably should re-layout their data structures
> to use SoA, where a vector width being a multiple of 3/6 provides no intrinsic benefit.
I've never actually seen this done in any game, but I've seen plenty of libs offering such easy to use option to get some extra performance from 4-wide SIMD. I've created such lib myself and got decent performance improvement for one use case (something like twice the performance) and a regression for another (probably due to 25 % cache waste).
The thing is, going from AoS to SoA is often completely unrealistic. For myself, it's often about having one or couple thingies that require some vector math to update. It's a chain of Vec3 operations, with plenty of ifs sprinkled around, repeated couple of times. Not dozens or hundreds of times, as would be needed to make SoA model make sense (also, this case is obviously no good for GPGPU).
Generally speaking, making some code run twice as fast is nice, but unless the code is already optimized, it actually isn't anything to write home about. If you can get it by replacing Vec3 with SIMDVec4 and tweaking a thing or two, it's fine. But if you need to profile the code and 50 % of time revert the changes, because you got a performance regression, it's not.
So I can understand why such libs exist and why they might be used here and there, but the benefits are likely nothing you'd want to parade around. Spending any significant time with them doesn't seem like a good idea.
-JLarja
> Kevin G (kevin.delete@this.cubitdesigns.com) on September 27, 2021 9:46 am wrote:
> > Three and six element vectors are relatively common for 3D work. Early in the history of SIMD when it made
> > sense to have a CPU code path for this, the code would simply
> > round out to the next largest power of 2 vector
> > size and live with the 25% inefficiency. (Three 32 floats would run on a 128 bit SIMD unit etc.) Nowadays
> > such bulk work is done on GPUs where vector elements are decomposed and that 25% inefficiency is recovered.
>
> I don't know the specifics of your example, but it sounds symptomatic of poor code design
> or memory layout. It sounds a lot like someone who took a scalar AoS design, then threw
> the x, y and z coordinates horizontally into a single vector, and patted themselves on the
> back for adopting SIMD. In reality, they probably should re-layout their data structures
> to use SoA, where a vector width being a multiple of 3/6 provides no intrinsic benefit.
I've never actually seen this done in any game, but I've seen plenty of libs offering such easy to use option to get some extra performance from 4-wide SIMD. I've created such lib myself and got decent performance improvement for one use case (something like twice the performance) and a regression for another (probably due to 25 % cache waste).
The thing is, going from AoS to SoA is often completely unrealistic. For myself, it's often about having one or couple thingies that require some vector math to update. It's a chain of Vec3 operations, with plenty of ifs sprinkled around, repeated couple of times. Not dozens or hundreds of times, as would be needed to make SoA model make sense (also, this case is obviously no good for GPGPU).
Generally speaking, making some code run twice as fast is nice, but unless the code is already optimized, it actually isn't anything to write home about. If you can get it by replacing Vec3 with SIMDVec4 and tweaking a thing or two, it's fine. But if you need to profile the code and 50 % of time revert the changes, because you got a performance regression, it's not.
So I can understand why such libs exist and why they might be used here and there, but the benefits are likely nothing you'd want to parade around. Spending any significant time with them doesn't seem like a good idea.
-JLarja