By: Wilco (wilco.dijkstra.delete@this.ntlworld.com), January 12, 2021 4:17 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on January 12, 2021 3:37 am wrote:
> none (none.delete@this.none.com) on January 12, 2021 1:15 am wrote:
> > Adrian (a.delete@this.acm.org) on January 12, 2021 12:54 am wrote:
> > [...]
> > > Given the gmp results, there is no doubt that the throughput is of one 128-bit product per cycle.
> > >
> > > There are 2 way to achieve that, either 2 independent multipliers for the lower product
> > > and for the upper product, which can work in parallel, or a pair of instructions for
> > > generating the 2 product halves is fused into one full product multiplication.
> > >
> > > The instruction fusion is more likely, because the hardware (the single full-length multiplier)
> > > is cheaper and it covers almost all cases when you want both kinds of products.
> >
> > The low and high parts of the mul are not fused on the Apple Silicon as far as I could
> > measure. As I previously wrote that's 2 64x64b->64b (high or low part) integer muls per
> > cycle; what I forgot to clearly state is that these two muls can be completely independent
> > (meaning not hi/lo mul with the same sources).
>
>
> Thanks for the information. This is good to know.
>
> That means that in other programs, which do not use bignums, but which contain integer multiplications,
> Apple M1 is able to do twice more multiplications per cycle than Intel or AMD or ARM Cortex-X1.
>
> So this feature, among many others, adds to the ability of M1 to have a higher IPC.
Cortex-X1 can also do 2 64-bit multiplies per cycle. The optimization guide is quite clear it can do 2 MULH, or 1 MADD and 1 MULH per cycle. But 2 MULH also means 2 MUL, and the AArch32 section lists 2 MUL per cycle. So basically the limitation is only 1 of the 2 multiplies supports accumulation.
Wilco
> none (none.delete@this.none.com) on January 12, 2021 1:15 am wrote:
> > Adrian (a.delete@this.acm.org) on January 12, 2021 12:54 am wrote:
> > [...]
> > > Given the gmp results, there is no doubt that the throughput is of one 128-bit product per cycle.
> > >
> > > There are 2 way to achieve that, either 2 independent multipliers for the lower product
> > > and for the upper product, which can work in parallel, or a pair of instructions for
> > > generating the 2 product halves is fused into one full product multiplication.
> > >
> > > The instruction fusion is more likely, because the hardware (the single full-length multiplier)
> > > is cheaper and it covers almost all cases when you want both kinds of products.
> >
> > The low and high parts of the mul are not fused on the Apple Silicon as far as I could
> > measure. As I previously wrote that's 2 64x64b->64b (high or low part) integer muls per
> > cycle; what I forgot to clearly state is that these two muls can be completely independent
> > (meaning not hi/lo mul with the same sources).
>
>
> Thanks for the information. This is good to know.
>
> That means that in other programs, which do not use bignums, but which contain integer multiplications,
> Apple M1 is able to do twice more multiplications per cycle than Intel or AMD or ARM Cortex-X1.
>
> So this feature, among many others, adds to the ability of M1 to have a higher IPC.
Cortex-X1 can also do 2 64-bit multiplies per cycle. The optimization guide is quite clear it can do 2 MULH, or 1 MADD and 1 MULH per cycle. But 2 MULH also means 2 MUL, and the AArch32 section lists 2 MUL per cycle. So basically the limitation is only 1 of the 2 multiplies supports accumulation.
Wilco