Understanding Cortex M4F instructions timing

By: Michael S (already5chosen.delete@this.yahoo.com), June 2, 2020 8:23 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on June 1, 2020 11:07 am wrote:
> May be, I am 10-15 years out of date, but popularity of Cortex M4 is still
> very high and documentation is still, IMHO, lacking. So, let's go.
> I think, that I understand how integer part of the core works.
> 1. All computational instructions (except division) take 1 clock.
> 2. Integer Loads take 2 clocks (latency)
> 2a. but can be pipelined with other Loads or stores, for higher
> throughput, in the limit approaching 1 Load per clock.
> 2b. Load&store instructions consume their input operands (offset and base) earlier than other instructions,
> so when their inputs are modified by previous instruction, there is 1-clock stall.
> It's not clear if this penalty applies only to arithmetic/move/load-target
> or also to base register, updated by Load/Store with pre/post-increment.
> 2c. It does not matter if destination register of load is used by very next computational
> instruction or not, there is always 1-clock pipeline bubble between load and computational.
> According to my understanding the reason is a contention on register file write port.
> What I don't understand is how FP part of the core work.
> 1. I have no mental figure of relationship between integer pipeline and FP pipeline.
> 2. I don't understand why FP Load instructions can't be pipelined with other FP
> Load/Store instructions in the manner similar to their Integer counterparts.
> 3. TRM claims that computational FP instructions, like VADD.F32 or VMUL.F32
> have latency of 2 clocks and throughput of 1 per clock, but my measurements
> (on STM32F303) clearly show that the throughput is twice lower than claimed.
> 4. According to TRM, multiply-accumulate instructions, both fused (VFMA.F32) and non-fused
> (VMLA.F32) are slower than properly scheduled separate add+mul. If it's true then why
> compilers, in particular gcc, generate them when optimizing for speed (-O2)?
> 5. If VLDM.32 is really so much faster than the sequence of VLDR.32 as claimed
> in TRM then why gcc -O2 does not generate VLDM.32 at every opportunity?

It turned out that my program storage (flash) was too slow: 2 wait state => at most 8 bytes per 3 clocks. So majority of my measurements were bounded by IFetch bandwidth.
I moved test routines into 0ws SRAM and results changed significantly. But I still didn't observe instruction timings equal to those claimed in TRM.
It turned out, the core is very sensitive to code alignment. It does not like when 32-bit code words not aligned on 32-bit boundaries. It appears to be especially important for FP code where absolute majority of instructions are 32b.

It seems, 'gcc -O2' is not aware of this wonderful feature of M4F core. I am not ready to blame gcc in this particular case, because TRM says nothing about it.

Another interesting finding - when everything is aligned and scheduled, the code runs somewhat faster than what would be predicted by TRM figures. It seems, VLDR.F32 *does* behave similarly to integer LDR in a sense that a sequence of N VLDR.F32 instructions sometimes takes less that 2*N clocks.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Understanding Cortex M4F instructions timingMichael S2020/06/01 11:07 AM
  Understanding Cortex M4F instructions timinganon³2020/06/01 10:26 PM
  Understanding Cortex M4F instructions timingMichael S2020/06/02 08:23 AM
  Understanding Cortex M4F instructions timingDan Fay2020/06/02 08:37 AM
    Understanding Cortex M4F instructions timingDan Fay2020/06/02 09:19 AM
      Understanding Cortex M4F instructions timingMichael S2020/06/02 09:48 AM
        Understanding Cortex M4F instructions timingMichael S2020/06/02 11:56 AM
          Understanding Cortex M4F instructions timingMichael S2020/06/02 12:07 PM
            Understanding Cortex M4F instructions timingDan Fay2020/06/02 01:22 PM
          Understanding Cortex M4F instructions timingDan Fay2020/06/02 01:08 PM
            Understanding Cortex M4F instructions timingMichael S2020/06/02 01:20 PM
          Understanding Cortex M4F instructions timingWilco2020/06/02 03:02 PM
            Understanding Cortex M4F instructions timingMichael S2020/06/02 03:17 PM
            Understanding Cortex M4F - VLDMMichael S2020/06/04 02:28 PM
            The goal of Cortex-M4 FPUMichael S2020/06/04 02:30 PM
              The goal of Cortex-M4 FPUDan Fay2020/06/05 08:31 AM
      ARMC6 - Arm or clang ?Michael S2020/06/05 05:49 AM
        ARMC6 - Arm or clang ?Dan Fay2020/06/05 08:26 AM
          ARMC6 - Arm or clang ?Michael S2020/06/05 08:55 AM
            M4F - few convolution benchesMichael S2020/06/11 09:35 AM
Reply to this Topic
Body: No Text
How do you spell purple?