By: Adrian (a.delete@this.acm.org), July 27, 2022 7:25 am

Room: Moderated Discussions

Bill G (bill.delete@this.g.com) on July 26, 2022 4:25 pm wrote:

> Adrian (a.delete@this.acm.org) on July 26, 2022 2:53 am wrote:

> > Aarch64 and any other ISA must also add IFMA instructions to be

> > able to have a big number performance comparable with x86-64,

> > because there is no way to have a comparable throughput when using scalar multiplications.

>

> Why is a new IFMA instruction needed, instead of just doing 52-bit integer

> multiplication with a 64-bit floating-point FMA vector instruction?

>

> Michael S (already5chosen.delete@this.yahoo.com) on July 26, 2022 5:55 am wrote:

> > ... But it has something else - fully pipelined FP double-precision

> > division unit. Could it be used to speed up big-integer division

> > for above-mentioned lengths? My answer is "I don't know before I

> > try.". And "try" here does not mean something quick, but rather

> > a week or four of hard work of writing and comparing several alternatives.

>

> Digit based division, where a digit here would be 52 bits, is quite complicated. I can see keeping a pipelined

> division unit full when doing multiple big integer divides in parallel but I can’t see how to keep a pipelined

> division unit full for a single big integer division. This would be a good thesis project for someone.

>

> Adrian (a.delete@this.acm.org) on July 26, 2022 7:59 am wrote:

> > Unlike for multiplication, for division or square root it is less likely

> > that vector instructions could be helpful in big number computations,

> > as in the algorithms that use such instructions their operands are typically

> > dependent on earlier results. So I do not expect any future

> > ISA improvements comparable to the introduction of IFMA, but only

> > slightly faster implementations of the existing instructions.

>

> In order to use the fast multiplies provided by vector multiply instructions, what

> would you think of implementing big integer division with Newton-Raphson iterations

> followed by a fix up step at the end to get the correctly rounded answer? A Newton-Raphson

> iteration computes 1/a with two multiplications per iteration like this:

>

> X(n+1) = X(n) ( 2 - a X(n) )

Because such algorithms are frequently used for the division of big numbers, the speed of division depends on the speed of multiplication, so something like IFMA obviously improves both.

There are also alternative algorithms, which for very big numbers use also mostly multiplications, but they also use some divisions, whose execution time can become significant for relatively small numbers.

What I have said above is that while the multiplication part of the algorithm can be accelerated with vector multiplications, i.e. with IFMA, the division part will not be accelerated by vector divisions, except in the trivial case when many independent long divisions are done concurrently, so it might be possible to aggregate some independent scalar divisions into a vector division.

Therefore it is unlikely that some vector integer division instructions will ever be introduced. At most, other ISA's might introduce a complete 128-bit addition in the IFMA, as the Intel IFMA instructions can do only a 64-bit addition together with the multiplication, either for the high part or for the low part of the product.

> Adrian (a.delete@this.acm.org) on July 26, 2022 2:53 am wrote:

> > Aarch64 and any other ISA must also add IFMA instructions to be

> > able to have a big number performance comparable with x86-64,

> > because there is no way to have a comparable throughput when using scalar multiplications.

>

> Why is a new IFMA instruction needed, instead of just doing 52-bit integer

> multiplication with a 64-bit floating-point FMA vector instruction?

>

> Michael S (already5chosen.delete@this.yahoo.com) on July 26, 2022 5:55 am wrote:

> > ... But it has something else - fully pipelined FP double-precision

> > division unit. Could it be used to speed up big-integer division

> > for above-mentioned lengths? My answer is "I don't know before I

> > try.". And "try" here does not mean something quick, but rather

> > a week or four of hard work of writing and comparing several alternatives.

>

> Digit based division, where a digit here would be 52 bits, is quite complicated. I can see keeping a pipelined

> division unit full when doing multiple big integer divides in parallel but I can’t see how to keep a pipelined

> division unit full for a single big integer division. This would be a good thesis project for someone.

>

> Adrian (a.delete@this.acm.org) on July 26, 2022 7:59 am wrote:

> > Unlike for multiplication, for division or square root it is less likely

> > that vector instructions could be helpful in big number computations,

> > as in the algorithms that use such instructions their operands are typically

> > dependent on earlier results. So I do not expect any future

> > ISA improvements comparable to the introduction of IFMA, but only

> > slightly faster implementations of the existing instructions.

>

> In order to use the fast multiplies provided by vector multiply instructions, what

> would you think of implementing big integer division with Newton-Raphson iterations

> followed by a fix up step at the end to get the correctly rounded answer? A Newton-Raphson

> iteration computes 1/a with two multiplications per iteration like this:

>

> X(n+1) = X(n) ( 2 - a X(n) )

Because such algorithms are frequently used for the division of big numbers, the speed of division depends on the speed of multiplication, so something like IFMA obviously improves both.

There are also alternative algorithms, which for very big numbers use also mostly multiplications, but they also use some divisions, whose execution time can become significant for relatively small numbers.

What I have said above is that while the multiplication part of the algorithm can be accelerated with vector multiplications, i.e. with IFMA, the division part will not be accelerated by vector divisions, except in the trivial case when many independent long divisions are done concurrently, so it might be possible to aggregate some independent scalar divisions into a vector division.

Therefore it is unlikely that some vector integer division instructions will ever be introduced. At most, other ISA's might introduce a complete 128-bit addition in the IFMA, as the Intel IFMA instructions can do only a 64-bit addition together with the multiplication, either for the high part or for the low part of the product.

Topic | Posted By | Date |
---|---|---|

Yitian 710 | anonymous2 | 2021/10/20 08:57 PM |

Yitian 710 | Adrian | 2021/10/21 12:20 AM |

Yitian 710 | Wilco | 2021/10/21 03:47 AM |

Yitian 710 | Rayla | 2021/10/21 05:52 AM |

Yitian 710 | Wilco | 2021/10/21 11:59 AM |

Yitian 710 | anon2 | 2021/10/21 05:16 PM |

Yitian 710 | Wilco | 2022/07/16 12:21 PM |

Yitian 710 | Anon | 2022/07/16 08:22 PM |

Yitian 710 | Rayla | 2022/07/17 09:10 AM |

Yitian 710 | Anon | 2022/07/17 12:04 PM |

Yitian 710 | Rayla | 2022/07/17 12:08 PM |

Yitian 710 | Wilco | 2022/07/17 01:16 PM |

Yitian 710 | Anon | 2022/07/17 01:32 PM |

Yitian 710 | Wilco | 2022/07/17 02:22 PM |

Yitian 710 | Anon | 2022/07/17 02:47 PM |

Yitian 710 | Wilco | 2022/07/17 03:50 PM |

Yitian 710 | Anon | 2022/07/17 08:46 PM |

Yitian 710 | Wilco | 2022/07/18 03:01 AM |

Yitian 710 | Anon | 2022/07/19 11:21 AM |

Yitian 710 | Wilco | 2022/07/19 06:15 PM |

Yitian 710 | Anon | 2022/07/21 01:25 AM |

Yitian 710 | none | 2022/07/21 01:49 AM |

Yitian 710 | Anon | 2022/07/21 03:03 AM |

Yitian 710 | none | 2022/07/21 04:34 AM |

Yitian 710 | James | 2022/07/21 02:29 AM |

Yitian 710 | Anon | 2022/07/21 03:05 AM |

Yitian 710 | Wilco | 2022/07/21 04:31 AM |

Yitian 710 | Anon | 2022/07/21 05:17 AM |

Yitian 710 | Wilco | 2022/07/21 05:33 AM |

Yitian 710 | Anon | 2022/07/21 05:50 AM |

Yitian 710 | Wilco | 2022/07/21 06:07 AM |

Yitian 710 | Anon | 2022/07/21 06:20 AM |

Yitian 710 | Wilco | 2022/07/21 10:02 AM |

Yitian 710 | Anon | 2022/07/21 10:22 AM |

Yitian 710 | Adrian | 2022/07/17 11:09 PM |

Yitian 710 | Wilco | 2022/07/18 01:15 AM |

Yitian 710 | Adrian | 2022/07/18 02:35 AM |

Yitian 710 | Adrian | 2022/07/16 11:19 PM |

Computations on Big Integers | Bill G | 2022/07/25 10:06 PM |

Computations on Big Integers | none | 2022/07/25 11:35 PM |

x86 MUL 64x64 | Eric Fink | 2022/07/26 01:06 AM |

x86 MUL 64x64 | Adrian | 2022/07/26 02:27 AM |

x86 MUL 64x64 | none | 2022/07/26 02:38 AM |

x86 MUL 64x64 | Jörn Engel | 2022/07/26 10:17 AM |

x86 MUL 64x64 | Linus Torvalds | 2022/07/27 10:13 AM |

x86 MUL 64x64 | ⚛ | 2022/07/28 09:40 AM |

x86 MUL 64x64 | Jörn Engel | 2022/07/28 10:18 AM |

More than 3 registers per instruction | -.- | 2022/07/28 07:01 PM |

More than 3 registers per instruction | Anon | 2022/07/28 10:39 PM |

More than 3 registers per instruction | Jörn Engel | 2022/07/28 10:42 PM |

More than 3 registers per instruction | -.- | 2022/07/29 04:31 AM |

Computations on Big Integers | Bill G | 2022/07/26 01:40 AM |

Computations on Big Integers | none | 2022/07/26 02:17 AM |

Computations on Big Integers | Bill G | 2022/07/26 03:52 AM |

Computations on Big Integers | --- | 2022/07/26 09:57 AM |

Computations on Big Integers | Adrian | 2022/07/26 02:53 AM |

Computations on Big Integers | Bill G | 2022/07/26 03:39 AM |

Computations on Big Integers | Adrian | 2022/07/26 04:21 AM |

Computations on Big Integers in Apple AMX Units | Bill G | 2022/07/26 04:28 AM |

Computations on Big Integers in Apple AMX Units | Adrian | 2022/07/26 05:13 AM |

Typo | Adrian | 2022/07/26 05:20 AM |

IEEE binary64 is 53 bits rather than 52. (NT) | Michael S | 2022/07/26 05:34 AM |

IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/26 07:32 AM |

IEEE binary64 is 53 bits rather than 52. | Michael S | 2022/07/26 10:02 AM |

IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 06:58 AM |

IEEE binary64 is 53 bits rather than 52. | none | 2022/07/27 07:14 AM |

IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 07:55 AM |

Thanks a lot for the link to the article! (NT) | none | 2022/07/27 08:09 AM |

Typo | zArchJon | 2022/07/26 09:51 AM |

Typo | Michael S | 2022/07/26 10:25 AM |

Typo | zArchJon | 2022/07/26 11:52 AM |

Typo | Michael S | 2022/07/26 01:02 PM |

Computations on Big Integers | Michael S | 2022/07/26 05:55 AM |

Computations on Big Integers | Adrian | 2022/07/26 07:59 AM |

IFMA and Division | Bill G | 2022/07/26 04:25 PM |

IFMA and Division | rwessel | 2022/07/26 08:16 PM |

IFMA and Division | Adrian | 2022/07/27 07:25 AM |

Computations on Big Integers | none | 2022/07/27 01:22 AM |

Big integer multiplication with vector IFMA | Bill G | 2022/07/29 01:06 AM |

Big integer multiplication with vector IFMA | Adrian | 2022/07/29 01:35 AM |

Big integer multiplication with vector IFMA | -.- | 2022/07/29 04:32 AM |

Big integer multiplication with vector IFMA | Adrian | 2022/07/29 09:47 PM |

Big integer multiplication with vector IFMA | Anon | 2022/07/30 08:12 AM |

Big integer multiplication with vector IFMA | Adrian | 2022/07/30 09:27 AM |

AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/07/31 03:20 PM |

AVX-512 unfriendly to heter-performance cores | Anon | 2022/07/31 03:33 PM |

AVX-512 unfriendly to heter-performance cores | anonymou5 | 2022/07/31 05:03 PM |

AVX-512 unfriendly to heter-performance cores | Brett | 2022/07/31 07:26 PM |

AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/01 01:45 AM |

Why can't E-cores have narrow/slow AVX-512? (NT) | anonymous2 | 2022/08/01 03:37 PM |

Why can't E-cores have narrow/slow AVX-512? | Ivan | 2022/08/02 12:09 AM |

Why can't E-cores have narrow/slow AVX-512? | anonymou5 | 2022/08/02 10:13 AM |

Why can't E-cores have narrow/slow AVX-512? | Dummond D. Slow | 2022/08/02 03:02 PM |

AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/08/02 01:19 PM |

AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/02 09:09 PM |

AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/03 12:50 AM |

AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:15 AM |

AVX-512 unfriendly to heter-performance cores | -.- | 2022/08/03 08:17 PM |

AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:02 PM |

IFMA: empty promises from Intel as usual | Kent R | 2022/07/29 07:15 PM |

No hype lasts forever | Anon | 2022/07/30 08:06 AM |

Big integer multiplication with vector IFMA | me | 2022/07/30 09:15 AM |

Computations on Big Integers | --- | 2022/07/26 09:48 AM |

Computations on Big Integers | none | 2022/07/27 01:10 AM |

Computations on Big Integers | --- | 2022/07/28 11:43 AM |

Computations on Big Integers | --- | 2022/07/28 06:44 PM |

Computations on Big Integers | dmcq | 2022/07/26 02:27 PM |

Computations on Big Integers | Adrian | 2022/07/27 08:15 AM |

Computations on Big Integers | Brett | 2022/07/27 11:07 AM |

Yitian 710 | Wes Felter | 2021/10/21 12:51 PM |

Yitian 710 | Adrian | 2021/10/21 01:25 PM |

Yitian 710 | Anon | 2021/10/21 06:08 AM |

Strange definition of the word single. (NT) | anon2 | 2021/10/21 05:00 PM |

AMD Epyc uses chiplets. This is why "strange"? | Mark Roulo | 2021/10/21 05:08 PM |

AMD Epyc uses chiplets. This is why "strange"? | anon2 | 2021/10/21 05:34 PM |

Yeah. Blame spec.org, too, though! | Mark Roulo | 2021/10/21 05:58 PM |

Yeah. Blame spec.org, too, though! | anon2 | 2021/10/21 08:07 PM |

Yeah. Blame spec.org, too, though! | Björn Ragnar Björnsson | 2022/07/17 06:23 AM |

Yeah. Blame spec.org, too, though! | Rayla | 2022/07/17 09:13 AM |

Yeah. Blame spec.org, too, though! | Anon | 2022/07/17 12:01 PM |