M4F - few convolution benches

By: Michael S (already5chosen.delete@this.yahoo.com), June 11, 2020 9:35 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on June 5, 2020 8:55 am wrote:
>
>
> For my real code, which does not resamble this tiny examples, I am very disappointed
> with both gcc and clang. They are stupid both in common ways and in different ways.
> Common: they never use VLDM
> Different:
> gcc:
> gcc doesn't align 32-bit instructions on 32-bit boundaries, even when it's very easy to do.
> gcc uses vfma, unless prevented to do so by -std=c99 or by -ffp-contract=on
>
> clang:
> clang doesn't schedule dependent vmul.F32 and vadd.F32
> one instruction apart. Even when it's very easy to do.
>

Today I finally found a bit of time to test real-world heavily optimized at source level convolution kernel.
The kernel do baseband processing of a signal that arrives from analog-to-digital converter as signed 16-bit samples. Input is converted to single-precision floating point and processed through pair of FIR decimation filters. Each filter consists of 50 taps. Decimation factor = 10.
Each iteration of the filter consumes 40 input samples are generates 8 output samples (4 complex numbers).
The number of "algorithmic" operations per iteration, non counting call, setup and loop overhead = 180 loads (40 new inputs, 40 delay line reads, 100 filter taps),
48 stores (40 delay line writes + 8 results),
400 FPMULs, 400 FPADDs.
Total number of "algorithmic" operations = 1028.

The results are as following

Compiler Code-Size F-clocks S-clocks F-FLOPs/Hz S-FLOPs/Hz
clang 560 1982 1674 0.404 0.478
gcc1 384 1787 1694 0.448 0.472
gcc2 548 2015 1714 0.397 0.467
gcc1a 384 1763 1603 0.450 0.499
gcc2a 548 1983 1249 0.403 0.641


clang = clang 9.0.0 with default optimization for speed options set
gcc1 = gcc 9.2.1 ARM/arm-9-branch revision 277599 with default optimization for speed options set
gcc2 = the same as gcc1 with -std=c99. It prevents gcc from generation of vfma.F32
gcc1a = the same as gcc1 with -falign-loops=4
gcc2a = the same as gcc2 with -falign-loops=4

F-clocks column corresponds to code running on-chip flash (2 wait states)
S-clocks column corresponds to code from on-chip fast SRAM (0 wait states)

My future plans:
Step 1. would be playing in assembler to examine an effect of concentrating as many as possible vldr.F32 instructions in continuous batches.
Step 2 would be replacement of vldr with vldm.
But none of it is going to happen until the middle of the next week.

< Previous Post in Thread 
TopicPosted ByDate
Understanding Cortex M4F instructions timingMichael S2020/06/01 11:07 AM
  Understanding Cortex M4F instructions timinganon³2020/06/01 10:26 PM
  Understanding Cortex M4F instructions timingMichael S2020/06/02 08:23 AM
  Understanding Cortex M4F instructions timingDan Fay2020/06/02 08:37 AM
    Understanding Cortex M4F instructions timingDan Fay2020/06/02 09:19 AM
      Understanding Cortex M4F instructions timingMichael S2020/06/02 09:48 AM
        Understanding Cortex M4F instructions timingMichael S2020/06/02 11:56 AM
          Understanding Cortex M4F instructions timingMichael S2020/06/02 12:07 PM
            Understanding Cortex M4F instructions timingDan Fay2020/06/02 01:22 PM
          Understanding Cortex M4F instructions timingDan Fay2020/06/02 01:08 PM
            Understanding Cortex M4F instructions timingMichael S2020/06/02 01:20 PM
          Understanding Cortex M4F instructions timingWilco2020/06/02 03:02 PM
            Understanding Cortex M4F instructions timingMichael S2020/06/02 03:17 PM
            Understanding Cortex M4F - VLDMMichael S2020/06/04 02:28 PM
            The goal of Cortex-M4 FPUMichael S2020/06/04 02:30 PM
              The goal of Cortex-M4 FPUDan Fay2020/06/05 08:31 AM
      ARMC6 - Arm or clang ?Michael S2020/06/05 05:49 AM
        ARMC6 - Arm or clang ?Dan Fay2020/06/05 08:26 AM
          ARMC6 - Arm or clang ?Michael S2020/06/05 08:55 AM
            M4F - few convolution benchesMichael S2020/06/11 09:35 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?