By: Dan Fay (firstname.lastname@example.org), June 2, 2020 8:37 am
Room: Moderated Discussions
> What I don't understand is how FP part of the core work.
> 1. I have no mental figure of relationship between integer pipeline and FP pipeline.
> 2. I don't understand why FP Load instructions can't be pipelined with other FP
> Load/Store instructions in the manner similar to their Integer counterparts.
> 3. TRM claims that computational FP instructions, like VADD.F32 or VMUL.F32
> have latency of 2 clocks and throughput of 1 per clock, but my measurements
> (on STM32F303) clearly show that the throughput is twice lower than claimed.
> 4. According to TRM, multiply-accumulate instructions, both fused (VFMA.F32) and non-fused
> (VMLA.F32) are slower than properly scheduled separate add+mul. If it's true then why
> compilers, in particular gcc, generate them when optimizing for speed (-O2)?
> 5. If VLDM.32 is really so much faster than the sequence of VLDR.32 as claimed
> in TRM then why gcc -O2 does not generate VLDM.32 at every opportunity?
I'm guessing that the FP ADD+MUL take up more code space than a single FMA instruction? I wouldn't be surprised that, even for -O2, they decide smaller code is worth the trouble with a microcontroller. What happens with -O3?
I'm also curious what ARM's compiler does. I'm going to try to look at what it generates for an M4F (specifically, an STM32F412).