Understanding Cortex M4F instructions timing

By: Michael S (already5chosen.delete@this.yahoo.com), June 2, 2020 2:17 pm
Room: Moderated Discussions
Wilco (wilco.dijkstra.delete@this.ntlworld.com) on June 2, 2020 3:02 pm wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on June 2, 2020 11:56 am wrote:
> > I was in harry while writing a previous post.
> > The example above is not a good one, because in case above VMLA.F32 is a *good* choice.
> >
> > As I said in original post, according to TRM VMLA.F32 slower than properly scheduled separate add+mul.
>
> Strictly speaking it is not slower - the latency and throughput are identical.
> However you get one extra execute slot per fma if you split into mul+add.
>
> > But example above is too short and does not provide an opportunity for proper scheduling.
> >
> > This example is better:

> > void foo(float* restrict res, const float x[4], float y, float z)
> > {
> > res[0] = x[0]*y + z;
> > res[1] = x[1]*y + z;
> > res[2] = x[2]*y + z;
> > res[3] = x[3]*y + z;
> > }
>
> > gcc on godbolt: https://godbolt.org/z/e6EHce
>
> How about this? This should probably be the default for Cortex-M4, just like LLVM.

-std=c99 would have the same effect.
But it works only because gcc devs didn't hear yet about VMLA.F32.
Tomorrow smart guy in gcc team discovers existence of VMLA and our trick gone.

> Note the
> generated code is the same for all CPUs, even say -mcortex-a53. This is a good example why AArch64
> added 4-operand FMA - expanding mov+fma back into mul+add would be better in this case.
>

My example is just an example. It's relatively atypical.
In the most typical use cases (variations of dot product) 3-op FMA (or non-fused MA) tends to be good enough.

> The goal of Cortex-M4 is to beat software floating point emulation - it achieves that.
>
> Wilco

Yes, Arm inc. can't be wrong, even when their docs are incomplete. I know.


< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Understanding Cortex M4F instructions timingMichael S2020/06/01 10:07 AM
  Understanding Cortex M4F instructions timinganon³2020/06/01 09:26 PM
  Understanding Cortex M4F instructions timingMichael S2020/06/02 07:23 AM
  Understanding Cortex M4F instructions timingDan Fay2020/06/02 07:37 AM
    Understanding Cortex M4F instructions timingDan Fay2020/06/02 08:19 AM
      Understanding Cortex M4F instructions timingMichael S2020/06/02 08:48 AM
        Understanding Cortex M4F instructions timingMichael S2020/06/02 10:56 AM
          Understanding Cortex M4F instructions timingMichael S2020/06/02 11:07 AM
            Understanding Cortex M4F instructions timingDan Fay2020/06/02 12:22 PM
          Understanding Cortex M4F instructions timingDan Fay2020/06/02 12:08 PM
            Understanding Cortex M4F instructions timingMichael S2020/06/02 12:20 PM
          Understanding Cortex M4F instructions timingWilco2020/06/02 02:02 PM
            Understanding Cortex M4F instructions timingMichael S2020/06/02 02:17 PM
            Understanding Cortex M4F - VLDMMichael S2020/06/04 01:28 PM
            The goal of Cortex-M4 FPUMichael S2020/06/04 01:30 PM
              The goal of Cortex-M4 FPUDan Fay2020/06/05 07:31 AM
      ARMC6 - Arm or clang ?Michael S2020/06/05 04:49 AM
        ARMC6 - Arm or clang ?Dan Fay2020/06/05 07:26 AM
          ARMC6 - Arm or clang ?Michael S2020/06/05 07:55 AM
            M4F - few convolution benchesMichael S2020/06/11 08:35 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?