Understanding Cortex M4F instructions timing

By: Michael S (already5chosen.delete@this.yahoo.com), June 1, 2020 11:07 am
Room: Moderated Discussions
May be, I am 10-15 years out of date, but popularity of Cortex M4 is still very high and documentation is still, IMHO, lacking. So, let's go.

I think, that I understand how integer part of the core works.
1. All computational instructions (except division) take 1 clock.
2. Integer Loads take 2 clocks (latency)
2a. but can be pipelined with other Loads or stores, for higher throughput, in the limit approaching 1 Load per clock.
2b. Load&store instructions consume their input operands (offset and base) earlier than other instructions, so when their inputs are modified by previous instruction, there is 1-clock stall.
It's not clear if this penalty applies only to arithmetic/move/load-target or also to base register, updated by Load/Store with pre/post-increment.
2c. It does not matter if destination register of load is used by very next computational instruction or not, there is always 1-clock pipeline bubble between load and computational. According to my understanding the reason is a contention on register file write port.

What I don't understand is how FP part of the core work.
1. I have no mental figure of relationship between integer pipeline and FP pipeline.
2. I don't understand why FP Load instructions can't be pipelined with other FP Load/Store instructions in the manner similar to their Integer counterparts.
3. TRM claims that computational FP instructions, like VADD.F32 or VMUL.F32 have latency of 2 clocks and throughput of 1 per clock, but my measurements (on STM32F303) clearly show that the throughput is twice lower than claimed.
4. According to TRM, multiply-accumulate instructions, both fused (VFMA.F32) and non-fused (VMLA.F32) are slower than properly scheduled separate add+mul. If it's true then why compilers, in particular gcc, generate them when optimizing for speed (-O2)?
5. If VLDM.32 is really so much faster than the sequence of VLDR.32 as claimed in TRM then why gcc -O2 does not generate VLDM.32 at every opportunity?




 Next Post in Thread >
TopicPosted ByDate
Understanding Cortex M4F instructions timingMichael S2020/06/01 11:07 AM
  Understanding Cortex M4F instructions timinganon³2020/06/01 10:26 PM
  Understanding Cortex M4F instructions timingMichael S2020/06/02 08:23 AM
  Understanding Cortex M4F instructions timingDan Fay2020/06/02 08:37 AM
    Understanding Cortex M4F instructions timingDan Fay2020/06/02 09:19 AM
      Understanding Cortex M4F instructions timingMichael S2020/06/02 09:48 AM
        Understanding Cortex M4F instructions timingMichael S2020/06/02 11:56 AM
          Understanding Cortex M4F instructions timingMichael S2020/06/02 12:07 PM
            Understanding Cortex M4F instructions timingDan Fay2020/06/02 01:22 PM
          Understanding Cortex M4F instructions timingDan Fay2020/06/02 01:08 PM
            Understanding Cortex M4F instructions timingMichael S2020/06/02 01:20 PM
          Understanding Cortex M4F instructions timingWilco2020/06/02 03:02 PM
            Understanding Cortex M4F instructions timingMichael S2020/06/02 03:17 PM
            Understanding Cortex M4F - VLDMMichael S2020/06/04 02:28 PM
            The goal of Cortex-M4 FPUMichael S2020/06/04 02:30 PM
              The goal of Cortex-M4 FPUDan Fay2020/06/05 08:31 AM
      ARMC6 - Arm or clang ?Michael S2020/06/05 05:49 AM
        ARMC6 - Arm or clang ?Dan Fay2020/06/05 08:26 AM
          ARMC6 - Arm or clang ?Michael S2020/06/05 08:55 AM
            M4F - few convolution benchesMichael S2020/06/11 09:35 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?