By: none (none.delete@this.none.com), January 12, 2021 1:15 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on January 12, 2021 12:54 am wrote:
[...]
> Given the gmp results, there is no doubt that the throughput is of one 128-bit product per cycle.
>
> There are 2 way to achieve that, either 2 independent multipliers for the lower product
> and for the upper product, which can work in parallel, or a pair of instructions for
> generating the 2 product halves is fused into one full product multiplication.
>
> The instruction fusion is more likely, because the hardware (the single full-length multiplier)
> is cheaper and it covers almost all cases when you want both kinds of products.
The low and high parts of the mul are not fused on the Apple Silicon as far as I could
measure. As I previously wrote that's 2 64x64b->64b (high or low part) integer muls per
cycle; what I forgot to clearly state is that these two muls can be completely independent
(meaning not hi/lo mul with the same sources).
[...]
> Given the gmp results, there is no doubt that the throughput is of one 128-bit product per cycle.
>
> There are 2 way to achieve that, either 2 independent multipliers for the lower product
> and for the upper product, which can work in parallel, or a pair of instructions for
> generating the 2 product halves is fused into one full product multiplication.
>
> The instruction fusion is more likely, because the hardware (the single full-length multiplier)
> is cheaper and it covers almost all cases when you want both kinds of products.
The low and high parts of the mul are not fused on the Apple Silicon as far as I could
measure. As I previously wrote that's 2 64x64b->64b (high or low part) integer muls per
cycle; what I forgot to clearly state is that these two muls can be completely independent
(meaning not hi/lo mul with the same sources).