By: Adrian (a.delete@this.acm.org), January 12, 2021 12:54 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on January 11, 2021 12:09 pm wrote:
> Adrian (a.delete@this.acm.org) on January 11, 2021 4:34 am wrote:
> > My theory is that this happens because both M1 and Zen 3 have a
> > single 64bx64b->128 per cycle multiplier, as mentioned by you.
>
> I think what you are saying makes sense. ARM64 doesn't have a 64bx64b->128 multiply instruction,
> you are supposed to use a pair of low/high 64bx64b multiply. If I remember correctly, Andrei
> has estimated two integer multiply units for the current Apple CPUs. So it would take a cycle
> to do both low and high parts. I am too lazy to check the Intel/AMD throughput tables, but
> I assume that their throughput is 1 such instruction per clock at best.
Given the gmp results, there is no doubt that the throughput is of one 128-bit product per cycle.
There are 2 way to achieve that, either 2 independent multipliers for the lower product and for the upper product, which can work in parallel, or a pair of instructions for generating the 2 product halves is fused into one full product multiplication.
The instruction fusion is more likely, because the hardware (the single full-length multiplier) is cheaper and it covers almost all cases when you want both kinds of products.
> Adrian (a.delete@this.acm.org) on January 11, 2021 4:34 am wrote:
> > My theory is that this happens because both M1 and Zen 3 have a
> > single 64bx64b->128 per cycle multiplier, as mentioned by you.
>
> I think what you are saying makes sense. ARM64 doesn't have a 64bx64b->128 multiply instruction,
> you are supposed to use a pair of low/high 64bx64b multiply. If I remember correctly, Andrei
> has estimated two integer multiply units for the current Apple CPUs. So it would take a cycle
> to do both low and high parts. I am too lazy to check the Intel/AMD throughput tables, but
> I assume that their throughput is 1 such instruction per clock at best.
Given the gmp results, there is no doubt that the throughput is of one 128-bit product per cycle.
There are 2 way to achieve that, either 2 independent multipliers for the lower product and for the upper product, which can work in parallel, or a pair of instructions for generating the 2 product halves is fused into one full product multiplication.
The instruction fusion is more likely, because the hardware (the single full-length multiplier) is cheaper and it covers almost all cases when you want both kinds of products.