By: K.K (anon.delete@this.anon.coom), December 5, 2020 3:31 am

Room: Moderated Discussions

I tried to benchmark the peak compute performance of the M1 GPU this morning (I have a MacBook Pro). Apple claims that the GPU is capable of 2.6 TFLOPS, yet clpeak only reports measured 1.1 TFLOPS, so I decided to do my own tests. Please keep in mind that I have no experience with these kind of benchmarks and that my approach might be extremely naive. I hope that you smart people can help me make more sense of the results, because I find them quite strange (apologies for the long post lack of spacing, the forum seems to gobble up indentation?)

Basic methodology was to run a long chain of FMA operations on the GPU. I chose the simple geometric Taylor series expansion

because it produces meaningful results and can be trivially computed as a chain of FMA operations like this

The metal shader was based around the following C++ template:

The compute kernel function itself is them fairly trivial. Memory reads/writes are avoided as much as possible, there is only a final store to prevent the kernel of being optimised out. Below an example of a kernel that runs 512 FMA operations (for 1024 total FLOP per invocation):

Now, here to the results.

On my 16" MacBook Pro with an AMD 5500M I am getting an average of 3.6-3.8 TFLOPS with large enough kernel execution grid, which is in line with it's claimed peak of 4TFLOPS. On the M1 however, the same kernel only yields 1.2 - 1.3 TFLOPS, same as clpeak and half of what Apple is claiming. Note that M1 has 1024 GPU ALUs, so this is consistent with 1 FLOP per ALU per clock assuming the later is around 1.2ghz (which again is realistic).

This is where things are getting interesting. I tried to run two interleaved FMA chains at the same time (each template invocation computes two Taylor series) and the performance jumped to 2.5-2.7 TFLOPS on M1 — no change on the AMD Navi GPU. Running three or more chains does not change the result (the throughput actually decreases slightly).

I then tried to estimate the performance of half-precision computation. First the AMD Navi results which is known to have double FP16 rate. In my initial benchmark, FP16 performance was identical to the FP32 performance on the Navi. However, if I compute two or more interleaved expansions, the performance goes up to 6-7TFLOPS. This suggests that in order to benefit from the dual-rate FP16, two FP16 operations must be issued simultaneously. Now a big surprise: on M1, FP16 does not seem to have any impact on performance whatsoever. The results are identical to FP32. The calculation is definitely performed with lower precision (I verified the results), but the performance does not change. I find it a bit shocking, since common knowledge says that mobile GPUs heavily invest in fast reduced precision ALUs. I still need to test it on my iPhone. I suspect there might be a flaw in my methodology as I have difficulty believing that Apple doesn't have faster FP16. Maybe I should try something else, like computing a FP32 and FP16 chain simultaneously?

To summarise:

1. Peak compute performance on M1 is only reached if multiple dependency chains are interleaved. I do not know whether this means that each ALU is capable of doing two FP32 operations per clock or whether there is some sort of pipeline effect.

2. M1 seems to run FP32 and FP16 operations at the same speed, at least in my simple example. I am very sceptical about this result.

Do you have any suggestions how I can improve this test and what else can I do?

Basic methodology was to run a long chain of FMA operations on the GPU. I chose the simple geometric Taylor series expansion

1/(1-x) = sum x^k (k>=0)

because it produces meaningful results and can be trivially computed as a chain of FMA operations like this

sum_0 = x

sum_k = fma(x, x, sum_{k-1})

The metal shader was based around the following C++ template:

template

struct taylor_series_sum {

static T compute(T x) {

return fma(x, x, taylor_series_sum::compute(x));

}

};

template

struct taylor_series_sum {

static T compute(T x) {

return x;

}

};

The compute kernel function itself is them fairly trivial. Memory reads/writes are avoided as much as possible, there is only a final store to prevent the kernel of being optimised out. Below an example of a kernel that runs 512 FMA operations (for 1024 total FLOP per invocation):

kernel void benchmark(...) {

out[0] = taylor_series( (float) index / (float) nthreads);

}

Now, here to the results.

On my 16" MacBook Pro with an AMD 5500M I am getting an average of 3.6-3.8 TFLOPS with large enough kernel execution grid, which is in line with it's claimed peak of 4TFLOPS. On the M1 however, the same kernel only yields 1.2 - 1.3 TFLOPS, same as clpeak and half of what Apple is claiming. Note that M1 has 1024 GPU ALUs, so this is consistent with 1 FLOP per ALU per clock assuming the later is around 1.2ghz (which again is realistic).

This is where things are getting interesting. I tried to run two interleaved FMA chains at the same time (each template invocation computes two Taylor series) and the performance jumped to 2.5-2.7 TFLOPS on M1 — no change on the AMD Navi GPU. Running three or more chains does not change the result (the throughput actually decreases slightly).

I then tried to estimate the performance of half-precision computation. First the AMD Navi results which is known to have double FP16 rate. In my initial benchmark, FP16 performance was identical to the FP32 performance on the Navi. However, if I compute two or more interleaved expansions, the performance goes up to 6-7TFLOPS. This suggests that in order to benefit from the dual-rate FP16, two FP16 operations must be issued simultaneously. Now a big surprise: on M1, FP16 does not seem to have any impact on performance whatsoever. The results are identical to FP32. The calculation is definitely performed with lower precision (I verified the results), but the performance does not change. I find it a bit shocking, since common knowledge says that mobile GPUs heavily invest in fast reduced precision ALUs. I still need to test it on my iPhone. I suspect there might be a flaw in my methodology as I have difficulty believing that Apple doesn't have faster FP16. Maybe I should try something else, like computing a FP32 and FP16 chain simultaneously?

To summarise:

1. Peak compute performance on M1 is only reached if multiple dependency chains are interleaved. I do not know whether this means that each ALU is capable of doing two FP32 operations per clock or whether there is some sort of pipeline effect.

2. M1 seems to run FP32 and FP16 operations at the same speed, at least in my simple example. I am very sceptical about this result.

Do you have any suggestions how I can improve this test and what else can I do?