By: Adrian (a.delete@this.acm.org), January 12, 2021 3:37 am
Room: Moderated Discussions
none (none.delete@this.none.com) on January 12, 2021 1:15 am wrote:
> Adrian (a.delete@this.acm.org) on January 12, 2021 12:54 am wrote:
> [...]
> > Given the gmp results, there is no doubt that the throughput is of one 128-bit product per cycle.
> >
> > There are 2 way to achieve that, either 2 independent multipliers for the lower product
> > and for the upper product, which can work in parallel, or a pair of instructions for
> > generating the 2 product halves is fused into one full product multiplication.
> >
> > The instruction fusion is more likely, because the hardware (the single full-length multiplier)
> > is cheaper and it covers almost all cases when you want both kinds of products.
>
> The low and high parts of the mul are not fused on the Apple Silicon as far as I could
> measure. As I previously wrote that's 2 64x64b->64b (high or low part) integer muls per
> cycle; what I forgot to clearly state is that these two muls can be completely independent
> (meaning not hi/lo mul with the same sources).
Thanks for the information. This is good to know.
That means that in other programs, which do not use bignums, but which contain integer multiplications, Apple M1 is able to do twice more multiplications per cycle than Intel or AMD or ARM Cortex-X1.
So this feature, among many others, adds to the ability of M1 to have a higher IPC.
> Adrian (a.delete@this.acm.org) on January 12, 2021 12:54 am wrote:
> [...]
> > Given the gmp results, there is no doubt that the throughput is of one 128-bit product per cycle.
> >
> > There are 2 way to achieve that, either 2 independent multipliers for the lower product
> > and for the upper product, which can work in parallel, or a pair of instructions for
> > generating the 2 product halves is fused into one full product multiplication.
> >
> > The instruction fusion is more likely, because the hardware (the single full-length multiplier)
> > is cheaper and it covers almost all cases when you want both kinds of products.
>
> The low and high parts of the mul are not fused on the Apple Silicon as far as I could
> measure. As I previously wrote that's 2 64x64b->64b (high or low part) integer muls per
> cycle; what I forgot to clearly state is that these two muls can be completely independent
> (meaning not hi/lo mul with the same sources).
Thanks for the information. This is good to know.
That means that in other programs, which do not use bignums, but which contain integer multiplications, Apple M1 is able to do twice more multiplications per cycle than Intel or AMD or ARM Cortex-X1.
So this feature, among many others, adds to the ability of M1 to have a higher IPC.