By: Jukka Larja (roskakori2006.delete@this.gmail.com), September 29, 2021 6:50 am
Room: Moderated Discussions
-.- (blarg.delete@this.mailinator.com) on September 28, 2021 6:37 pm wrote:
> Jukka Larja (roskakori2006.delete@this.gmail.com) on September 28, 2021 6:37 am wrote:
> > I've never actually seen this done in any game, but I've seen plenty of libs offering such
> > easy to use option to get some extra performance from 4-wide SIMD. I've created such lib
> > myself and got decent performance improvement for one use case (something like twice the
> > performance) and a regression for another (probably due to 25 % cache waste).
> >
> > The thing is, going from AoS to SoA is often completely unrealistic. For myself, it's often about having
> > one or couple thingies that require some vector math to update. It's a chain of Vec3 operations, with
> > plenty of ifs sprinkled around, repeated couple of times. Not dozens or hundreds of times, as would
> > be needed to make SoA model make sense (also, this case is obviously no good for GPGPU).
>
> I'm surprised that you've never seen it done before, though I can certainly
> see cases where SoA (or AoSoA) doesn't fit well, whether it's existing code
> bases stuck on AoS, or it causes too many issues with non-SIMD stuff etc.
Well, I should have been a bit more specific. It is done in our engine, in one particular place. I don't know if anyone ever profiled the code though. These days it's in a form that makes it hard to tell how much time is actually spent there, and there's no non-SIMD implementation available.
I've seen the remnants of similar code in other codebases, but they were from some previous projects. I think that many programmers try the SIMDVec4 style for fun, because it's easy. Seldom does it actually lead to significant performance increases. That's why I've seen the remnants, but not actually usage (or that's my theory anyway).
> Even there, it may be worthwhile to do a transposition on-the-fly (unless the calculation is particularly
> light), i.e. do AoSSoA transposition in registers, or in the case of ARM, there's LD3/LD4 instructions
> which automatically de-interleave for you with little to no overhead.
When the number of structs is in the ballpark of the width of vector units, it's pretty hard to make any sensible conversion. Especially, when execution is likely to diverge after first couple of operations.
> > If you can get it by replacing Vec3
> > with SIMDVec4 and tweaking a thing or two, it's fine. But if you need to profile the code and
> > 50 % of time revert the changes, because you got a performance regression, it's not.
>
> To the armchair expert that I am (who's never touched any 3D code), the whole
> vec3 concept sounds like something that's fraught with potential potholes.
> SIMD is generally not good with horizontal operations, so by placing x,y,z in the same vector,
> any calculation requiring interaction between the components is going to be more difficult.
>
> I get that there can be some easy gains to be had, compared to just doing everything scalar, so has its
> uses, but this form of fixed width usage also doesn't really take advantage of SVE's processing model.
You are very much correct, though as I said previously, gains often turn to regressions due to those potholes. As a general advise, I would definitely tell people not to waste time on Vec3-to-SIMDVec4 kind of optimizations (and code like that may not be received well if presented in job interview). But they are, in some cases the best one can do with SIMD, and also very easy to try.
-JLarja
> Jukka Larja (roskakori2006.delete@this.gmail.com) on September 28, 2021 6:37 am wrote:
> > I've never actually seen this done in any game, but I've seen plenty of libs offering such
> > easy to use option to get some extra performance from 4-wide SIMD. I've created such lib
> > myself and got decent performance improvement for one use case (something like twice the
> > performance) and a regression for another (probably due to 25 % cache waste).
> >
> > The thing is, going from AoS to SoA is often completely unrealistic. For myself, it's often about having
> > one or couple thingies that require some vector math to update. It's a chain of Vec3 operations, with
> > plenty of ifs sprinkled around, repeated couple of times. Not dozens or hundreds of times, as would
> > be needed to make SoA model make sense (also, this case is obviously no good for GPGPU).
>
> I'm surprised that you've never seen it done before, though I can certainly
> see cases where SoA (or AoSoA) doesn't fit well, whether it's existing code
> bases stuck on AoS, or it causes too many issues with non-SIMD stuff etc.
Well, I should have been a bit more specific. It is done in our engine, in one particular place. I don't know if anyone ever profiled the code though. These days it's in a form that makes it hard to tell how much time is actually spent there, and there's no non-SIMD implementation available.
I've seen the remnants of similar code in other codebases, but they were from some previous projects. I think that many programmers try the SIMDVec4 style for fun, because it's easy. Seldom does it actually lead to significant performance increases. That's why I've seen the remnants, but not actually usage (or that's my theory anyway).
> Even there, it may be worthwhile to do a transposition on-the-fly (unless the calculation is particularly
> light), i.e. do AoSSoA transposition in registers, or in the case of ARM, there's LD3/LD4 instructions
> which automatically de-interleave for you with little to no overhead.
When the number of structs is in the ballpark of the width of vector units, it's pretty hard to make any sensible conversion. Especially, when execution is likely to diverge after first couple of operations.
> > If you can get it by replacing Vec3
> > with SIMDVec4 and tweaking a thing or two, it's fine. But if you need to profile the code and
> > 50 % of time revert the changes, because you got a performance regression, it's not.
>
> To the armchair expert that I am (who's never touched any 3D code), the whole
> vec3 concept sounds like something that's fraught with potential potholes.
> SIMD is generally not good with horizontal operations, so by placing x,y,z in the same vector,
> any calculation requiring interaction between the components is going to be more difficult.
>
> I get that there can be some easy gains to be had, compared to just doing everything scalar, so has its
> uses, but this form of fixed width usage also doesn't really take advantage of SVE's processing model.
You are very much correct, though as I said previously, gains often turn to regressions due to those potholes. As a general advise, I would definitely tell people not to waste time on Vec3-to-SIMDVec4 kind of optimizations (and code like that may not be received well if presented in job interview). But they are, in some cases the best one can do with SIMD, and also very easy to try.
-JLarja