By: Adrian (a.delete@this.acm.org), December 5, 2020 4:00 am
Room: Moderated Discussions
K.K (anon.delete@this.anon.coom) on December 5, 2020 2:31 am wrote:
> I tried to benchmark the peak compute performance of the M1 GPU this morning (I have a MacBook
> Pro). Apple claims that the GPU is capable of 2.6 TFLOPS, yet clpeak only reports measured
> 1.1 TFLOPS, so I decided to do my own tests. Please keep in mind that I have no experience
> with these kind of benchmarks and that my approach might be extremely naive. I hope that you
> smart people can help me make more sense of the results, because I find them quite strange
> (apologies for the long post lack of spacing, the forum seems to gobble up indentation?)
>
> Basic methodology was to run a long chain of FMA operations on
> the GPU. I chose the simple geometric Taylor series expansion
>
>
>
> because it produces meaningful results and can be trivially computed as a chain of FMA operations like this
>
>
>
> The metal shader was based around the following C++ template:
>
>
>
>
> The compute kernel function itself is them fairly trivial. Memory reads/writes are avoided as
> much as possible, there is only a final store to prevent the kernel of being optimised out. Below
> an example of a kernel that runs 512 FMA operations (for 1024 total FLOP per invocation):
>
>
>
>
> Now, here to the results.
>
> On my 16" MacBook Pro with an AMD 5500M I am getting an average of 3.6-3.8 TFLOPS with large
> enough kernel execution grid, which is in line with it's claimed peak of 4TFLOPS. On the
> M1 however, the same kernel only yields 1.2 - 1.3 TFLOPS, same as clpeak and half of what
> Apple is claiming. Note that M1 has 1024 GPU ALUs, so this is consistent with 1 FLOP per
> ALU per clock assuming the later is around 1.2ghz (which again is realistic).
>
> This is where things are getting interesting. I tried to run two interleaved FMA chains
> at the same time (each template invocation computes two Taylor series) and the performance
> jumped to 2.5-2.7 TFLOPS on M1 — no change on the AMD Navi GPU. Running three or more
> chains does not change the result (the throughput actually decreases slightly).
>
> I then tried to estimate the performance of half-precision computation. First the AMD Navi results which is
> known to have double FP16 rate. In my initial benchmark, FP16 performance was identical to the FP32 performance
> on the Navi. However, if I compute two or more interleaved expansions, the performance goes up to 6-7TFLOPS.
> This suggests that in order to benefit from the dual-rate FP16, two FP16 operations must be issued simultaneously.
> Now a big surprise: on M1, FP16 does not seem to have any impact on performance whatsoever. The results are
> identical to FP32. The calculation is definitely performed with lower precision (I verified the results),
> but the performance does not change. I find it a bit shocking, since common knowledge says that mobile GPUs
> heavily invest in fast reduced precision ALUs. I still need to test it on my iPhone. I suspect there might
> be a flaw in my methodology as I have difficulty believing that Apple doesn't have faster FP16. Maybe I should
> try something else, like computing a FP32 and FP16 chain simultaneously?
>
> To summarise:
>
> 1. Peak compute performance on M1 is only reached if multiple dependency chains are
> interleaved. I do not know whether this means that each ALU is capable of doing two
> FP32 operations per clock or whether there is some sort of pipeline effect.
>
> 2. M1 seems to run FP32 and FP16 operations at the same speed, at least
> in my simple example. I am very sceptical about this result.
>
> Do you have any suggestions how I can improve this test and what else can I do?
>
>
As you have already said, we know that the M1 GPU has 1024 FP32 ALUs running at around 1.3 GHz.
Therefore you have reached the expected speed of around 2.6 TFlops @ FP32, exactly like others that have published M1 benchmark results, so your test methodology should be OK.
Appple did not publish any details about their GPU, especially about what restrictions/constraints it might have, so I would not find it unusual if the latency of its FMA32 operation is 2 cycles, or if it does not hardware to accelerate FP16 operations.
For the latency, it would be interesting to repeat your tests for simple FP32 adds and for simple FP32 multiplies, to see if they still have the same 2 cycle latency or if they have a 1 cycle latency, so that there is no need for interleaving to reach the full speed of 1.3 Tflops in their case.
For FP16, many older GPUs implemented the operations for compatibility but executed them in the same FP32 units, at the same speed.
Apple might have additional tensor operations for FP16, like NVIDIA and now also AMD, which might allow faster speeds than with FP32, for applications that can take advantage of the reduced precision, or they might expect that machine learning applications use their separate unit intended for that, not the GPU, so there was no reason to accelerate the FP16 operations in the GPU.
> I tried to benchmark the peak compute performance of the M1 GPU this morning (I have a MacBook
> Pro). Apple claims that the GPU is capable of 2.6 TFLOPS, yet clpeak only reports measured
> 1.1 TFLOPS, so I decided to do my own tests. Please keep in mind that I have no experience
> with these kind of benchmarks and that my approach might be extremely naive. I hope that you
> smart people can help me make more sense of the results, because I find them quite strange
> (apologies for the long post lack of spacing, the forum seems to gobble up indentation?)
>
> Basic methodology was to run a long chain of FMA operations on
> the GPU. I chose the simple geometric Taylor series expansion
>
>
> 1/(1-x) = sum x^k (k>=0)
>
>
> because it produces meaningful results and can be trivially computed as a chain of FMA operations like this
>
>
> sum_0 = x
> sum_k = fma(x, x, sum_{k-1})
>
>
> The metal shader was based around the following C++ template:
>
>
> template
> struct taylor_series_sum {
> static T compute(T x) {
> return fma(x, x, taylor_series_sum::compute(x));
> }
> };
>
> template
> struct taylor_series_sum {
> static T compute(T x) {
> return x;
> }
> };
>
>
>
> The compute kernel function itself is them fairly trivial. Memory reads/writes are avoided as
> much as possible, there is only a final store to prevent the kernel of being optimised out. Below
> an example of a kernel that runs 512 FMA operations (for 1024 total FLOP per invocation):
>
>
> kernel void benchmark(...) {
> out[0] = taylor_series( (float) index / (float) nthreads);
> }
>
>
>
> Now, here to the results.
>
> On my 16" MacBook Pro with an AMD 5500M I am getting an average of 3.6-3.8 TFLOPS with large
> enough kernel execution grid, which is in line with it's claimed peak of 4TFLOPS. On the
> M1 however, the same kernel only yields 1.2 - 1.3 TFLOPS, same as clpeak and half of what
> Apple is claiming. Note that M1 has 1024 GPU ALUs, so this is consistent with 1 FLOP per
> ALU per clock assuming the later is around 1.2ghz (which again is realistic).
>
> This is where things are getting interesting. I tried to run two interleaved FMA chains
> at the same time (each template invocation computes two Taylor series) and the performance
> jumped to 2.5-2.7 TFLOPS on M1 — no change on the AMD Navi GPU. Running three or more
> chains does not change the result (the throughput actually decreases slightly).
>
> I then tried to estimate the performance of half-precision computation. First the AMD Navi results which is
> known to have double FP16 rate. In my initial benchmark, FP16 performance was identical to the FP32 performance
> on the Navi. However, if I compute two or more interleaved expansions, the performance goes up to 6-7TFLOPS.
> This suggests that in order to benefit from the dual-rate FP16, two FP16 operations must be issued simultaneously.
> Now a big surprise: on M1, FP16 does not seem to have any impact on performance whatsoever. The results are
> identical to FP32. The calculation is definitely performed with lower precision (I verified the results),
> but the performance does not change. I find it a bit shocking, since common knowledge says that mobile GPUs
> heavily invest in fast reduced precision ALUs. I still need to test it on my iPhone. I suspect there might
> be a flaw in my methodology as I have difficulty believing that Apple doesn't have faster FP16. Maybe I should
> try something else, like computing a FP32 and FP16 chain simultaneously?
>
> To summarise:
>
> 1. Peak compute performance on M1 is only reached if multiple dependency chains are
> interleaved. I do not know whether this means that each ALU is capable of doing two
> FP32 operations per clock or whether there is some sort of pipeline effect.
>
> 2. M1 seems to run FP32 and FP16 operations at the same speed, at least
> in my simple example. I am very sceptical about this result.
>
> Do you have any suggestions how I can improve this test and what else can I do?
>
>
As you have already said, we know that the M1 GPU has 1024 FP32 ALUs running at around 1.3 GHz.
Therefore you have reached the expected speed of around 2.6 TFlops @ FP32, exactly like others that have published M1 benchmark results, so your test methodology should be OK.
Appple did not publish any details about their GPU, especially about what restrictions/constraints it might have, so I would not find it unusual if the latency of its FMA32 operation is 2 cycles, or if it does not hardware to accelerate FP16 operations.
For the latency, it would be interesting to repeat your tests for simple FP32 adds and for simple FP32 multiplies, to see if they still have the same 2 cycle latency or if they have a 1 cycle latency, so that there is no need for interleaving to reach the full speed of 1.3 Tflops in their case.
For FP16, many older GPUs implemented the operations for compatibility but executed them in the same FP32 units, at the same speed.
Apple might have additional tensor operations for FP16, like NVIDIA and now also AMD, which might allow faster speeds than with FP32, for applications that can take advantage of the reduced precision, or they might expect that machine learning applications use their separate unit intended for that, not the GPU, so there was no reason to accelerate the FP16 operations in the GPU.