Understanding Cortex M4F instructions timing

By: anon³ (arm.delete@this.micros.test), June 1, 2020 9:26 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on June 1, 2020 11:07 am wrote:
> May be, I am 10-15 years out of date, but popularity of Cortex M4 is still
> very high and documentation is still, IMHO, lacking. So, let's go.

You're not out of date; the M4 has just never been all that popular compared to the M3. A shame, it's a nice core.

You're right that it's badly underdocumented though.

> What I don't understand is how FP part of the core work.
> 1. I have no mental figure of relationship between integer pipeline and FP pipeline.

I don't exactly know either, but I know they're relatively separate. I believe the FPU is technically a tightly-coupled coprocessor, or maybe I'm getting the M4 confused with something else.

The reference by Yiu (Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors) is nice, especially if you have to deal with interrupt handling (and you're going to have to deal with interrupt handling). There are things in it that aren't in the official docs, or at least are buried deep enough that I had trouble teasing them out.

> 2. I don't understand why FP Load instructions can't be pipelined with other FP
> Load/Store instructions in the manner similar to their Integer counterparts.

Probably that coprocessor bus.

> 3. TRM claims that computational FP instructions, like VADD.F32 or VMUL.F32
> have latency of 2 clocks and throughput of 1 per clock, but my measurements
> (on STM32F303) clearly show that the throughput is twice lower than claimed.

That's interesting. Most M4 applications probably aren't number-crunching, so who knows what they traded off. What was your test loop?

> 4. According to TRM, multiply-accumulate instructions, both fused (VFMA.F32) and non-fused
> (VMLA.F32) are slower than properly scheduled separate add+mul. If it's true then why
> compilers, in particular gcc, generate them when optimizing for speed (-O2)?
> 5. If VLDM.32 is really so much faster than the sequence of VLDR.32 as claimed
> in TRM then why gcc -O2 does not generate VLDM.32 at every opportunity?

GCC generates crappy code? Say it isn't so!
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Understanding Cortex M4F instructions timingMichael S2020/06/01 10:07 AM
  Understanding Cortex M4F instructions timinganon³2020/06/01 09:26 PM
  Understanding Cortex M4F instructions timingMichael S2020/06/02 07:23 AM
  Understanding Cortex M4F instructions timingDan Fay2020/06/02 07:37 AM
    Understanding Cortex M4F instructions timingDan Fay2020/06/02 08:19 AM
      Understanding Cortex M4F instructions timingMichael S2020/06/02 08:48 AM
        Understanding Cortex M4F instructions timingMichael S2020/06/02 10:56 AM
          Understanding Cortex M4F instructions timingMichael S2020/06/02 11:07 AM
            Understanding Cortex M4F instructions timingDan Fay2020/06/02 12:22 PM
          Understanding Cortex M4F instructions timingDan Fay2020/06/02 12:08 PM
            Understanding Cortex M4F instructions timingMichael S2020/06/02 12:20 PM
          Understanding Cortex M4F instructions timingWilco2020/06/02 02:02 PM
            Understanding Cortex M4F instructions timingMichael S2020/06/02 02:17 PM
            Understanding Cortex M4F - VLDMMichael S2020/06/04 01:28 PM
            The goal of Cortex-M4 FPUMichael S2020/06/04 01:30 PM
              The goal of Cortex-M4 FPUDan Fay2020/06/05 07:31 AM
      ARMC6 - Arm or clang ?Michael S2020/06/05 04:49 AM
        ARMC6 - Arm or clang ?Dan Fay2020/06/05 07:26 AM
          ARMC6 - Arm or clang ?Michael S2020/06/05 07:55 AM
            M4F - few convolution benchesMichael S2020/06/11 08:35 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?