By: -.- (blarg.delete@this.mailinator.com), September 28, 2021 6:37 pm
Room: Moderated Discussions
Jukka Larja (roskakori2006.delete@this.gmail.com) on September 28, 2021 6:37 am wrote:
> I've never actually seen this done in any game, but I've seen plenty of libs offering such
> easy to use option to get some extra performance from 4-wide SIMD. I've created such lib
> myself and got decent performance improvement for one use case (something like twice the
> performance) and a regression for another (probably due to 25 % cache waste).
>
> The thing is, going from AoS to SoA is often completely unrealistic. For myself, it's often about having
> one or couple thingies that require some vector math to update. It's a chain of Vec3 operations, with
> plenty of ifs sprinkled around, repeated couple of times. Not dozens or hundreds of times, as would
> be needed to make SoA model make sense (also, this case is obviously no good for GPGPU).
I'm surprised that you've never seen it done before, though I can certainly see cases where SoA (or AoSoA) doesn't fit well, whether it's existing code bases stuck on AoS, or it causes too many issues with non-SIMD stuff etc.
Even there, it may be worthwhile to do a transposition on-the-fly (unless the calculation is particularly light), i.e. do AoSSoA transposition in registers, or in the case of ARM, there's LD3/LD4 instructions which automatically de-interleave for you with little to no overhead.
> If you can get it by replacing Vec3
> with SIMDVec4 and tweaking a thing or two, it's fine. But if you need to profile the code and
> 50 % of time revert the changes, because you got a performance regression, it's not.
To the armchair expert that I am (who's never touched any 3D code), the whole vec3 concept sounds like something that's fraught with potential potholes.
SIMD is generally not good with horizontal operations, so by placing x,y,z in the same vector, any calculation requiring interaction between the components is going to be more difficult.
I get that there can be some easy gains to be had, compared to just doing everything scalar, so has its uses, but this form of fixed width usage also doesn't really take advantage of SVE's processing model.
> I've never actually seen this done in any game, but I've seen plenty of libs offering such
> easy to use option to get some extra performance from 4-wide SIMD. I've created such lib
> myself and got decent performance improvement for one use case (something like twice the
> performance) and a regression for another (probably due to 25 % cache waste).
>
> The thing is, going from AoS to SoA is often completely unrealistic. For myself, it's often about having
> one or couple thingies that require some vector math to update. It's a chain of Vec3 operations, with
> plenty of ifs sprinkled around, repeated couple of times. Not dozens or hundreds of times, as would
> be needed to make SoA model make sense (also, this case is obviously no good for GPGPU).
I'm surprised that you've never seen it done before, though I can certainly see cases where SoA (or AoSoA) doesn't fit well, whether it's existing code bases stuck on AoS, or it causes too many issues with non-SIMD stuff etc.
Even there, it may be worthwhile to do a transposition on-the-fly (unless the calculation is particularly light), i.e. do AoSSoA transposition in registers, or in the case of ARM, there's LD3/LD4 instructions which automatically de-interleave for you with little to no overhead.
> If you can get it by replacing Vec3
> with SIMDVec4 and tweaking a thing or two, it's fine. But if you need to profile the code and
> 50 % of time revert the changes, because you got a performance regression, it's not.
To the armchair expert that I am (who's never touched any 3D code), the whole vec3 concept sounds like something that's fraught with potential potholes.
SIMD is generally not good with horizontal operations, so by placing x,y,z in the same vector, any calculation requiring interaction between the components is going to be more difficult.
I get that there can be some easy gains to be had, compared to just doing everything scalar, so has its uses, but this form of fixed width usage also doesn't really take advantage of SVE's processing model.