By: K.K. (anon.delete@this.anon.com), December 7, 2020 2:32 am
Room: Moderated Discussions
Jeff S. (fakity.delete@this.fake.com) on December 6, 2020 10:12 pm wrote:
> This language strikes me as very odd. It certainly sounds
> to me more like packed fp16 was just dropped. If double-issue fp32 were added (supplying 6 operands
> to 2 FMA units), why not still support packed fp16 for 4 per clock per VRF port set?
>
I would hypothesize that it's the other way around and that A14 and other mobile GPUs simply take longer to execute FP32 operations. My tests so far seem to suggest that both FP32 and FP16 operations on M1 can be issued every cycle, with execution latency of two cycles. Maybe mobile GPUs need four cycles instead to do a single FP32 operation and Apple has tweaked the M1 to bring the FP32 performance up to the level of the FP16? I will run some tests on an A14 later this week.
> This language strikes me as very odd. It certainly sounds
> to me more like packed fp16 was just dropped. If double-issue fp32 were added (supplying 6 operands
> to 2 FMA units), why not still support packed fp16 for 4 per clock per VRF port set?
>
I would hypothesize that it's the other way around and that A14 and other mobile GPUs simply take longer to execute FP32 operations. My tests so far seem to suggest that both FP32 and FP16 operations on M1 can be issued every cycle, with execution latency of two cycles. Maybe mobile GPUs need four cycles instead to do a single FP32 operation and Apple has tweaked the M1 to bring the FP32 performance up to the level of the FP16? I will run some tests on an A14 later this week.