IEEE binary64 is 53 bits rather than 52.

That inaccessible article can be found at

There is a 50 MB PDF file, because is the first volume of the conference proceedings, not only that article.


Unlike in the older article, where the program was implemented on a simulator, here they have used real hardware, i.e. the famous Cannon Lake Intel NUC.

While that was the first device ever with IFMA, the performance on server CPUs with double FMA units should be double compared to what they have obtained on the NUC.

For their implementation, the crossover with GMP was at 1024-bit numbers, so it was useful only for numbers longer than that.

The overhead for short numbers can vary in very wide ranges depending on the quality of the implementation, so I have no idea if indeed IFMA needs numbers so large to be useful in general or that threshold was valid only for this specific implementation.

In any case even with a better implementation, IFMA is unlikely to be efficient for numbers shorter than 512 bits as for short numbers it is increasingly more difficult to keep all multipliers busy, though it might still be possible to use IFMA to increase the throughput of many concurrent multiplications of shorter numbers.

