By: Chester (lamchester.delete@this.gmail.com), December 5, 2020 10:39 am
Room: Moderated Discussions
> 1. Peak compute performance on M1 is only reached if multiple dependency chains are
> interleaved. I do not know whether this means that each ALU is capable of doing two
> FP32 operations per clock or whether there is some sort of pipeline effect.
Maybe it's like Nvidia GF104, where you need ILP to reach peak throughput? GF104 had 3x16 FP32 units, but could only select two warps per cycle. It had to dual issue from one of those warps to feed all the FP32 units. See page 14 here
Another possibility is for whatever reason, there weren't enough warps in flight per processing unit to hide execution latency. Then, ILP with multiple dependency chains made up for it.
> interleaved. I do not know whether this means that each ALU is capable of doing two
> FP32 operations per clock or whether there is some sort of pipeline effect.
Maybe it's like Nvidia GF104, where you need ILP to reach peak throughput? GF104 had 3x16 FP32 units, but could only select two warps per cycle. It had to dual issue from one of those warps to feed all the FP32 units. See page 14 here
Another possibility is for whatever reason, there weren't enough warps in flight per processing unit to hide execution latency. Then, ILP with multiple dependency chains made up for it.