By: K.K (anon.delete@this.anon.com), December 6, 2020 3:46 am
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on December 5, 2020 9:39 am wrote:
> > 1. Peak compute performance on M1 is only reached if multiple dependency chains are
> > interleaved. I do not know whether this means that each ALU is capable of doing two
> > FP32 operations per clock or whether there is some sort of pipeline effect.
>
> Maybe it's like Nvidia GF104, where you need ILP to reach peak throughput? GF104
> had 3x16 FP32 units, but could only select two warps per cycle. It had to dual
> issue from one of those warps to feed all the FP32 units. See page 14 here
>
> Another possibility is for whatever reason, there weren't enough warps in flight per processing
> unit to hide execution latency. Then, ILP with multiple dependency chains made up for it.
I now run more thorough tests, iterating through number of possibilities, based on Chester's and Adrian's suggestions. To manipulate ILP, I was running computation using multiple interleaved computation chains
Here are the results in GFLOPS for FP32 (results for FP16 are identical):
1 chain 2 chains 3 chains
Add 730 1400 1400
Add + Mul* 860 1680 1680
FMA 1400 2600 2600
I then tried to interleave FP32 and FP16 computations to see whether they are done by the same ALUs. Results using FMA
chains GFLOPS
1 FP32, 1 FP16 2600
1 FP32, 2 FP16 2000
1 FP32, 3 FP16 2000
1 FP32, 4 FP16 2000
2 FP32, 1 FP16 2600
2 FP32, 2 FP16 2600
2 FP32, 3 FP16 2600
2 FP32, 4 FP16 2600
If I understand these results correctly, the arithmetic ALU has a latency of 2 cycles but can start a new operation every cycle (that, or there are two ALUs, both with latency of 2 cycles, but that's probably less likely). It furthermore seems that FP16 and FP32 operations are executed on the same ALU and with the same speed.
Finally, we observe a performance regression if we compute a single FP32 chain interleaved with multiple FP16 chains. My uneducated guess is that there is a penalty (of exactly one cycle) for switching between FP32 and FP16 operation, but I find it puzzling that this penalty disappears when running 2 FP32 operations + 1 FP16 instead. Maybe the penalty only applies when switching from FP32 to FP16 but not the other way around?
> > 1. Peak compute performance on M1 is only reached if multiple dependency chains are
> > interleaved. I do not know whether this means that each ALU is capable of doing two
> > FP32 operations per clock or whether there is some sort of pipeline effect.
>
> Maybe it's like Nvidia GF104, where you need ILP to reach peak throughput? GF104
> had 3x16 FP32 units, but could only select two warps per cycle. It had to dual
> issue from one of those warps to feed all the FP32 units. See page 14 here
>
> Another possibility is for whatever reason, there weren't enough warps in flight per processing
> unit to hide execution latency. Then, ILP with multiple dependency chains made up for it.
I now run more thorough tests, iterating through number of possibilities, based on Chester's and Adrian's suggestions. To manipulate ILP, I was running computation using multiple interleaved computation chains
Here are the results in GFLOPS for FP32 (results for FP16 are identical):
1 chain 2 chains 3 chains
Add 730 1400 1400
Add + Mul* 860 1680 1680
FMA 1400 2600 2600
I then tried to interleave FP32 and FP16 computations to see whether they are done by the same ALUs. Results using FMA
chains GFLOPS
1 FP32, 1 FP16 2600
1 FP32, 2 FP16 2000
1 FP32, 3 FP16 2000
1 FP32, 4 FP16 2000
2 FP32, 1 FP16 2600
2 FP32, 2 FP16 2600
2 FP32, 3 FP16 2600
2 FP32, 4 FP16 2600
If I understand these results correctly, the arithmetic ALU has a latency of 2 cycles but can start a new operation every cycle (that, or there are two ALUs, both with latency of 2 cycles, but that's probably less likely). It furthermore seems that FP16 and FP32 operations are executed on the same ALU and with the same speed.
Finally, we observe a performance regression if we compute a single FP32 chain interleaved with multiple FP16 chains. My uneducated guess is that there is a penalty (of exactly one cycle) for switching between FP32 and FP16 operation, but I find it puzzling that this penalty disappears when running 2 FP32 operations + 1 FP16 instead. Maybe the penalty only applies when switching from FP32 to FP16 but not the other way around?