By: K.K. (anon.delete@this.anon.com), December 9, 2020 2:59 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on December 9, 2020 1:36 am wrote:
> A14 has double-rate FP16 ops, which is most likely achieved by splitting the FP32 ALUs in 2 FP16 ALUs.
I think it's the other way around. I believe A14 (and other Apple mobile GPU) have FP16 ALUs that are also capable of doing a FP32 operation, but at half the rate. I don't think it's unprecedented in the mobile space. You are looking at this from the perspective of the typical desktop implementation, but we should not forget that Apple GPUs evolved from the mobile design space.
An interesting observation is that when looking at annotated dies of A14 and M1, the GPU on M1 takes much more space in proportion to the entire chip. It's almost four times larger, despite only doubling the cores. I think this supports the hypothesis that they might have doubled the ALU width.
> M1 has same rate FP16 ops, and that can be achieved in 2 ways, either all the FP32 ALUs also
> function in a FP16 mode, so each ALU can do either a FP32 op or a FP16 op, or half of the ALUs
> are simpler and they can do only FP32 ops and they are not used for FP16, while the other half
> of the ALUs are more complex and they can be split into a double number of FP16 ALUs.
> [...]
> The first method will yield 1.3 Tflops FP32 + 1.3 Tflops FP16, while the second method would yield
> 1.3 Tflops FP32 + 2.6 Tflops FP16, unless some other limitation prevents reaching the maximum speed.
I will test it using mixed FP32 and FP16 code and report back.
> Based on your previous results, reaching that FP32 speed might need interleaving 2 chains.
Best to forget about my previous result, it turns out that the shader compiler was too smart for me, and it managed to optimize away some results. The new results fix this bug but I haven't implemented the mixed float and half computation yet.
> A14 has double-rate FP16 ops, which is most likely achieved by splitting the FP32 ALUs in 2 FP16 ALUs.
I think it's the other way around. I believe A14 (and other Apple mobile GPU) have FP16 ALUs that are also capable of doing a FP32 operation, but at half the rate. I don't think it's unprecedented in the mobile space. You are looking at this from the perspective of the typical desktop implementation, but we should not forget that Apple GPUs evolved from the mobile design space.
An interesting observation is that when looking at annotated dies of A14 and M1, the GPU on M1 takes much more space in proportion to the entire chip. It's almost four times larger, despite only doubling the cores. I think this supports the hypothesis that they might have doubled the ALU width.
> M1 has same rate FP16 ops, and that can be achieved in 2 ways, either all the FP32 ALUs also
> function in a FP16 mode, so each ALU can do either a FP32 op or a FP16 op, or half of the ALUs
> are simpler and they can do only FP32 ops and they are not used for FP16, while the other half
> of the ALUs are more complex and they can be split into a double number of FP16 ALUs.
> [...]
> The first method will yield 1.3 Tflops FP32 + 1.3 Tflops FP16, while the second method would yield
> 1.3 Tflops FP32 + 2.6 Tflops FP16, unless some other limitation prevents reaching the maximum speed.
I will test it using mixed FP32 and FP16 code and report back.
> Based on your previous results, reaching that FP32 speed might need interleaving 2 chains.
Best to forget about my previous result, it turns out that the shader compiler was too smart for me, and it managed to optimize away some results. The new results fix this bug but I haven't implemented the mixed float and half computation yet.