By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), July 7, 2015 4:54 pm
Room: Moderated Discussions
Exophase (exophase.delete@this.gmail.com) on July 7, 2015 1:26 pm wrote:
> Maynard Handley (name99.delete@this.name99.org) on July 7, 2015 12:00 pm wrote:
> > Maybe the problem is when I say cmov/csel I am thinking of the (IMHO) obvious use cases.
> > max/min, abs, sgn, and the sorts of very similar functions I constantly dealt with when
> > writing codecs (eg parse one bit then, if (bit){motionVector=-motionVector})
> > All of these strike me as PRECISELY the point of cmov/csel.
> > Perhaps it's my experience in this field where one CONSTANTLY
> > has these sorts of one instruction branch-overs
> > --- for clamping values, for non-linear edge smoothing, etc --- that makes me appreciate their value;
> > and perhaps most people just don't encounter this sort of code in the code they write?
>
> A lot of those operations are already commonly supported directly in modern SIMD architectures. Or can be
> synthesized in a similar or smaller number of instructions compared to a solution with cmov or csel. For example,
> on ARM NEON if (bit){motionVector=-motionVector} can be computed as (on a vector of 32-bit ints):
>
> vtst.u32 mask, bit, bit
> veor.u32 motionVector, motionVector, mask
> vsub.u32 motionVector, motionVector, mask
>
> Where the equivalent with conditional select would be something like:
>
> vtst.u32 mask, bit, bit
> vneg.s32 motionVectorNeg, motionVector
> vbit.u32 motionVector, motionVectorNeg, mask
>
> Although they're the same number of ops, the former may be preferable over the latter since
> it uses less registers and since vbit can have lower throughput than veor and vsub on some uarchs
> (on the other hand, the latter can be preferable because it has a shorter critical path)
Usually the select/condexec version is faster than the branched as well as the branchless sequence. In your example the 2nd sequence would be better as it is more parallel than the first. A better example is abs() where the branch is hard to predict, and branchless sequences need a long chain of dependent instructions (not just bad for latency but also hogging decode, rename, dispatch and ALUs). GCC now uses CSNEG for abs() on AArch64 as you would expect, giving significant gains on various benchmarks.
Wilco
> Maynard Handley (name99.delete@this.name99.org) on July 7, 2015 12:00 pm wrote:
> > Maybe the problem is when I say cmov/csel I am thinking of the (IMHO) obvious use cases.
> > max/min, abs, sgn, and the sorts of very similar functions I constantly dealt with when
> > writing codecs (eg parse one bit then, if (bit){motionVector=-motionVector})
> > All of these strike me as PRECISELY the point of cmov/csel.
> > Perhaps it's my experience in this field where one CONSTANTLY
> > has these sorts of one instruction branch-overs
> > --- for clamping values, for non-linear edge smoothing, etc --- that makes me appreciate their value;
> > and perhaps most people just don't encounter this sort of code in the code they write?
>
> A lot of those operations are already commonly supported directly in modern SIMD architectures. Or can be
> synthesized in a similar or smaller number of instructions compared to a solution with cmov or csel. For example,
> on ARM NEON if (bit){motionVector=-motionVector} can be computed as (on a vector of 32-bit ints):
>
> vtst.u32 mask, bit, bit
> veor.u32 motionVector, motionVector, mask
> vsub.u32 motionVector, motionVector, mask
>
> Where the equivalent with conditional select would be something like:
>
> vtst.u32 mask, bit, bit
> vneg.s32 motionVectorNeg, motionVector
> vbit.u32 motionVector, motionVectorNeg, mask
>
> Although they're the same number of ops, the former may be preferable over the latter since
> it uses less registers and since vbit can have lower throughput than veor and vsub on some uarchs
> (on the other hand, the latter can be preferable because it has a shorter critical path)
Usually the select/condexec version is faster than the branched as well as the branchless sequence. In your example the 2nd sequence would be better as it is more parallel than the first. A better example is abs() where the branch is hard to predict, and branchless sequences need a long chain of dependent instructions (not just bad for latency but also hogging decode, rename, dispatch and ALUs). GCC now uses CSNEG for abs() on AArch64 as you would expect, giving significant gains on various benchmarks.
Wilco