By: anon (anon.delete@this.anon.com), January 11, 2021 12:09 pm
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on January 11, 2021 4:34 am wrote:
> My theory is that this happens because both M1 and Zen 3 have a
> single 64bx64b->128 per cycle multiplier, as mentioned by you.
I think what you are saying makes sense. ARM64 doesn't have a 64bx64b->128 multiply instruction, you are supposed to use a pair of low/high 64bx64b multiply. If I remember correctly, Andrei has estimated two integer multiply units for the current Apple CPUs. So it would take a cycle to do both low and high parts. I am too lazy to check the Intel/AMD throughput tables, but I assume that their throughput is 1 such instruction per clock at best.
> My theory is that this happens because both M1 and Zen 3 have a
> single 64bx64b->128 per cycle multiplier, as mentioned by you.
I think what you are saying makes sense. ARM64 doesn't have a 64bx64b->128 multiply instruction, you are supposed to use a pair of low/high 64bx64b multiply. If I remember correctly, Andrei has estimated two integer multiply units for the current Apple CPUs. So it would take a cycle to do both low and high parts. I am too lazy to check the Intel/AMD throughput tables, but I assume that their throughput is 1 such instruction per clock at best.