By: Adrian (a.delete@this.acm.org), January 12, 2021 5:07 am
Room: Moderated Discussions
Wilco (wilco.dijkstra.delete@this.ntlworld.com) on January 12, 2021 4:17 am wrote:
> Adrian (a.delete@this.acm.org) on January 12, 2021 3:37 am wrote:
> > none (none.delete@this.none.com) on January 12, 2021 1:15 am wrote:
> > > Adrian (a.delete@this.acm.org) on January 12, 2021 12:54 am wrote:
> > > [...]
> > > > Given the gmp results, there is no doubt that the throughput is of one 128-bit product per cycle.
> > > >
> > > > There are 2 way to achieve that, either 2 independent multipliers for the lower product
> > > > and for the upper product, which can work in parallel, or a pair of instructions for
> > > > generating the 2 product halves is fused into one full product multiplication.
> > > >
> > > > The instruction fusion is more likely, because the hardware (the single full-length multiplier)
> > > > is cheaper and it covers almost all cases when you want both kinds of products.
> > >
> > > The low and high parts of the mul are not fused on the Apple Silicon as far as I could
> > > measure. As I previously wrote that's 2 64x64b->64b (high or low part) integer muls per
> > > cycle; what I forgot to clearly state is that these two muls can be completely independent
> > > (meaning not hi/lo mul with the same sources).
> >
> >
> > Thanks for the information. This is good to know.
> >
> > That means that in other programs, which do not use bignums, but which contain integer multiplications,
> > Apple M1 is able to do twice more multiplications per cycle than Intel or AMD or ARM Cortex-X1.
> >
> > So this feature, among many others, adds to the ability of M1 to have a higher IPC.
>
> Cortex-X1 can also do 2 64-bit multiplies per cycle. The optimization guide is quite clear it can do
> 2 MULH, or 1 MADD and 1 MULH per cycle. But 2 MULH also means 2 MUL, and the AArch32 section lists
> 2 MUL per cycle. So basically the limitation is only 1 of the 2 multiplies supports accumulation.
>
> Wilco
Thanks, I did not remember that.
When you have pointed previously that Cortex-A78 & Cortex-X1 have improved integer multiply throughput, I have verified in their guides that they can do a full 128-bit product per cycle, but I did not pay attention to whether they can do two 64-bit products per cycle, because that use case is less important for me.
> Adrian (a.delete@this.acm.org) on January 12, 2021 3:37 am wrote:
> > none (none.delete@this.none.com) on January 12, 2021 1:15 am wrote:
> > > Adrian (a.delete@this.acm.org) on January 12, 2021 12:54 am wrote:
> > > [...]
> > > > Given the gmp results, there is no doubt that the throughput is of one 128-bit product per cycle.
> > > >
> > > > There are 2 way to achieve that, either 2 independent multipliers for the lower product
> > > > and for the upper product, which can work in parallel, or a pair of instructions for
> > > > generating the 2 product halves is fused into one full product multiplication.
> > > >
> > > > The instruction fusion is more likely, because the hardware (the single full-length multiplier)
> > > > is cheaper and it covers almost all cases when you want both kinds of products.
> > >
> > > The low and high parts of the mul are not fused on the Apple Silicon as far as I could
> > > measure. As I previously wrote that's 2 64x64b->64b (high or low part) integer muls per
> > > cycle; what I forgot to clearly state is that these two muls can be completely independent
> > > (meaning not hi/lo mul with the same sources).
> >
> >
> > Thanks for the information. This is good to know.
> >
> > That means that in other programs, which do not use bignums, but which contain integer multiplications,
> > Apple M1 is able to do twice more multiplications per cycle than Intel or AMD or ARM Cortex-X1.
> >
> > So this feature, among many others, adds to the ability of M1 to have a higher IPC.
>
> Cortex-X1 can also do 2 64-bit multiplies per cycle. The optimization guide is quite clear it can do
> 2 MULH, or 1 MADD and 1 MULH per cycle. But 2 MULH also means 2 MUL, and the AArch32 section lists
> 2 MUL per cycle. So basically the limitation is only 1 of the 2 multiplies supports accumulation.
>
> Wilco
Thanks, I did not remember that.
When you have pointed previously that Cortex-A78 & Cortex-X1 have improved integer multiply throughput, I have verified in their guides that they can do a full 128-bit product per cycle, but I did not pay attention to whether they can do two 64-bit products per cycle, because that use case is less important for me.