By: Adrian (a.delete@this.acm.org), July 26, 2022 2:27 am

Room: Moderated Discussions

Eric Fink (eric.delete@this.anon.com) on July 26, 2022 1:06 am wrote:

> none (none.delete@this.none.com) on July 25, 2022 11:35 pm wrote:

> > Bill G (bill.delete@this.g.com) on July 25, 2022 10:06 pm wrote:

> > > Adrian (a.delete@this.acm.org) on July 16, 2022 11:19 pm wrote:

> > > > for the computations dominated by the use of floating-point numbers or big integer numbers,

> > > > the AMD and Intel CPUs remain without competition (except from GPUs).

> > >

> > > What is it about AMD and Intel CPUs that make them good for computations on big integers? ARM

> > > has add, subtract with carry flag instructions (ADC, SBC) and multiply, multiply-accumulate producing

> > > the full width product (MULL, MLAL). x86 has similar instructions (ADCX, ADOX, MULX).

> >

> > I'm not sure what Adrian claim is, but Apple M1 likely is the fastest per clock CPU for GMP.

> > See https://gmplib.org/gmpbench

> >

> > What matters (beyond what you wrote) is the number of integer multiplications you can issue

> > per clock. For AArch64, you don't have 64x64->128 bit multiply instructions, so you have to

> > use more 64x64->64 bit ones (AArch64 has 64x64->64 low and high, so you need 2 such

> > instructions; a possibility, not implemented as far as I know, would be to fuse two such

> > multiplies).

>

> Isn't it also the case that the x86 64x64 MUL has some significant restrictions for the location of

> the operands? If I remember correctly, one of them has to be in EAX and the result is placed in EAX

> + EDX or something like that. Depending on your problem, this might require additional scaffolding

> to get things into place, making the advantage over ARM's two instruction approach less obvious.

>

> Besides, are modern x86 implementations doing this operation with the same latency/throughtput

> as the truncated multiplication? I wouldn't be surprised if the full multiplication required

> more u-ops and were slower. Something like M1 that has more integer multiply ports can probably

> easily catch up even if the multiplication is split at the ISA level.

The restrictions for multiplication have been removed in x86-64 since 2013, 9 years ago, i.e. since Intel Haswell.

Now the integer multiplications are done with MULX, which has 4 operands.

One source must implicitly be RDX, but the other source and the 2 destinations for the 2 halves of the 128-bit product can be any registers.

For arbitrary 4 operands, you need an extra MOV to RDX, but that is completely free in modern CPUs.

However, since Intel Cannon Lake (2018), the CPUs that support AVX-512, e.g. the pending Zen 4 (and also Ice Lake, Tiger Lake, Rocket Lake, Sapphire Rapids), have a better way than MULX to do big number multiplications, i.e. IFMA.

Unfortunately, due to Intel's stupid handling of AVX-512, IFMA is still seldom used. That might change with Zen 4.

> none (none.delete@this.none.com) on July 25, 2022 11:35 pm wrote:

> > Bill G (bill.delete@this.g.com) on July 25, 2022 10:06 pm wrote:

> > > Adrian (a.delete@this.acm.org) on July 16, 2022 11:19 pm wrote:

> > > > for the computations dominated by the use of floating-point numbers or big integer numbers,

> > > > the AMD and Intel CPUs remain without competition (except from GPUs).

> > >

> > > What is it about AMD and Intel CPUs that make them good for computations on big integers? ARM

> > > has add, subtract with carry flag instructions (ADC, SBC) and multiply, multiply-accumulate producing

> > > the full width product (MULL, MLAL). x86 has similar instructions (ADCX, ADOX, MULX).

> >

> > I'm not sure what Adrian claim is, but Apple M1 likely is the fastest per clock CPU for GMP.

> > See https://gmplib.org/gmpbench

> >

> > What matters (beyond what you wrote) is the number of integer multiplications you can issue

> > per clock. For AArch64, you don't have 64x64->128 bit multiply instructions, so you have to

> > use more 64x64->64 bit ones (AArch64 has 64x64->64 low and high, so you need 2 such

> > instructions; a possibility, not implemented as far as I know, would be to fuse two such

> > multiplies).

>

> Isn't it also the case that the x86 64x64 MUL has some significant restrictions for the location of

> the operands? If I remember correctly, one of them has to be in EAX and the result is placed in EAX

> + EDX or something like that. Depending on your problem, this might require additional scaffolding

> to get things into place, making the advantage over ARM's two instruction approach less obvious.

>

> Besides, are modern x86 implementations doing this operation with the same latency/throughtput

> as the truncated multiplication? I wouldn't be surprised if the full multiplication required

> more u-ops and were slower. Something like M1 that has more integer multiply ports can probably

> easily catch up even if the multiplication is split at the ISA level.

The restrictions for multiplication have been removed in x86-64 since 2013, 9 years ago, i.e. since Intel Haswell.

Now the integer multiplications are done with MULX, which has 4 operands.

One source must implicitly be RDX, but the other source and the 2 destinations for the 2 halves of the 128-bit product can be any registers.

For arbitrary 4 operands, you need an extra MOV to RDX, but that is completely free in modern CPUs.

However, since Intel Cannon Lake (2018), the CPUs that support AVX-512, e.g. the pending Zen 4 (and also Ice Lake, Tiger Lake, Rocket Lake, Sapphire Rapids), have a better way than MULX to do big number multiplications, i.e. IFMA.

Unfortunately, due to Intel's stupid handling of AVX-512, IFMA is still seldom used. That might change with Zen 4.

Topic | Posted By | Date |
---|---|---|

Yitian 710 | anonymous2 | 2021/10/20 08:57 PM |

Yitian 710 | Adrian | 2021/10/21 12:20 AM |

Yitian 710 | Wilco | 2021/10/21 03:47 AM |

Yitian 710 | Rayla | 2021/10/21 05:52 AM |

Yitian 710 | Wilco | 2021/10/21 11:59 AM |

Yitian 710 | anon2 | 2021/10/21 05:16 PM |

Yitian 710 | Wilco | 2022/07/16 12:21 PM |

Yitian 710 | Anon | 2022/07/16 08:22 PM |

Yitian 710 | Rayla | 2022/07/17 09:10 AM |

Yitian 710 | Anon | 2022/07/17 12:04 PM |

Yitian 710 | Rayla | 2022/07/17 12:08 PM |

Yitian 710 | Wilco | 2022/07/17 01:16 PM |

Yitian 710 | Anon | 2022/07/17 01:32 PM |

Yitian 710 | Wilco | 2022/07/17 02:22 PM |

Yitian 710 | Anon | 2022/07/17 02:47 PM |

Yitian 710 | Wilco | 2022/07/17 03:50 PM |

Yitian 710 | Anon | 2022/07/17 08:46 PM |

Yitian 710 | Wilco | 2022/07/18 03:01 AM |

Yitian 710 | Anon | 2022/07/19 11:21 AM |

Yitian 710 | Wilco | 2022/07/19 06:15 PM |

Yitian 710 | Anon | 2022/07/21 01:25 AM |

Yitian 710 | none | 2022/07/21 01:49 AM |

Yitian 710 | Anon | 2022/07/21 03:03 AM |

Yitian 710 | none | 2022/07/21 04:34 AM |

Yitian 710 | James | 2022/07/21 02:29 AM |

Yitian 710 | Anon | 2022/07/21 03:05 AM |

Yitian 710 | Wilco | 2022/07/21 04:31 AM |

Yitian 710 | Anon | 2022/07/21 05:17 AM |

Yitian 710 | Wilco | 2022/07/21 05:33 AM |

Yitian 710 | Anon | 2022/07/21 05:50 AM |

Yitian 710 | Wilco | 2022/07/21 06:07 AM |

Yitian 710 | Anon | 2022/07/21 06:20 AM |

Yitian 710 | Wilco | 2022/07/21 10:02 AM |

Yitian 710 | Anon | 2022/07/21 10:22 AM |

Yitian 710 | Adrian | 2022/07/17 11:09 PM |

Yitian 710 | Wilco | 2022/07/18 01:15 AM |

Yitian 710 | Adrian | 2022/07/18 02:35 AM |

Yitian 710 | Adrian | 2022/07/16 11:19 PM |

Computations on Big Integers | Bill G | 2022/07/25 10:06 PM |

Computations on Big Integers | none | 2022/07/25 11:35 PM |

x86 MUL 64x64 | Eric Fink | 2022/07/26 01:06 AM |

x86 MUL 64x64 | Adrian | 2022/07/26 02:27 AM |

x86 MUL 64x64 | none | 2022/07/26 02:38 AM |

x86 MUL 64x64 | Jörn Engel | 2022/07/26 10:17 AM |

x86 MUL 64x64 | Linus Torvalds | 2022/07/27 10:13 AM |

x86 MUL 64x64 | ⚛ | 2022/07/28 09:40 AM |

x86 MUL 64x64 | Jörn Engel | 2022/07/28 10:18 AM |

More than 3 registers per instruction | -.- | 2022/07/28 07:01 PM |

More than 3 registers per instruction | Anon | 2022/07/28 10:39 PM |

More than 3 registers per instruction | Jörn Engel | 2022/07/28 10:42 PM |

More than 3 registers per instruction | -.- | 2022/07/29 04:31 AM |

Computations on Big Integers | Bill G | 2022/07/26 01:40 AM |

Computations on Big Integers | none | 2022/07/26 02:17 AM |

Computations on Big Integers | Bill G | 2022/07/26 03:52 AM |

Computations on Big Integers | --- | 2022/07/26 09:57 AM |

Computations on Big Integers | Adrian | 2022/07/26 02:53 AM |

Computations on Big Integers | Bill G | 2022/07/26 03:39 AM |

Computations on Big Integers | Adrian | 2022/07/26 04:21 AM |

Computations on Big Integers in Apple AMX Units | Bill G | 2022/07/26 04:28 AM |

Computations on Big Integers in Apple AMX Units | Adrian | 2022/07/26 05:13 AM |

Typo | Adrian | 2022/07/26 05:20 AM |

IEEE binary64 is 53 bits rather than 52. (NT) | Michael S | 2022/07/26 05:34 AM |

IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/26 07:32 AM |

IEEE binary64 is 53 bits rather than 52. | Michael S | 2022/07/26 10:02 AM |

IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 06:58 AM |

IEEE binary64 is 53 bits rather than 52. | none | 2022/07/27 07:14 AM |

IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 07:55 AM |

Thanks a lot for the link to the article! (NT) | none | 2022/07/27 08:09 AM |

Typo | zArchJon | 2022/07/26 09:51 AM |

Typo | Michael S | 2022/07/26 10:25 AM |

Typo | zArchJon | 2022/07/26 11:52 AM |

Typo | Michael S | 2022/07/26 01:02 PM |

Computations on Big Integers | Michael S | 2022/07/26 05:55 AM |

Computations on Big Integers | Adrian | 2022/07/26 07:59 AM |

IFMA and Division | Bill G | 2022/07/26 04:25 PM |

IFMA and Division | rwessel | 2022/07/26 08:16 PM |

IFMA and Division | Adrian | 2022/07/27 07:25 AM |

Computations on Big Integers | none | 2022/07/27 01:22 AM |

Big integer multiplication with vector IFMA | Bill G | 2022/07/29 01:06 AM |

Big integer multiplication with vector IFMA | Adrian | 2022/07/29 01:35 AM |

Big integer multiplication with vector IFMA | -.- | 2022/07/29 04:32 AM |

Big integer multiplication with vector IFMA | Adrian | 2022/07/29 09:47 PM |

Big integer multiplication with vector IFMA | Anon | 2022/07/30 08:12 AM |

Big integer multiplication with vector IFMA | Adrian | 2022/07/30 09:27 AM |

AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/07/31 03:20 PM |

AVX-512 unfriendly to heter-performance cores | Anon | 2022/07/31 03:33 PM |

AVX-512 unfriendly to heter-performance cores | anonymou5 | 2022/07/31 05:03 PM |

AVX-512 unfriendly to heter-performance cores | Brett | 2022/07/31 07:26 PM |

AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/01 01:45 AM |

Why can't E-cores have narrow/slow AVX-512? (NT) | anonymous2 | 2022/08/01 03:37 PM |

Why can't E-cores have narrow/slow AVX-512? | Ivan | 2022/08/02 12:09 AM |

Why can't E-cores have narrow/slow AVX-512? | anonymou5 | 2022/08/02 10:13 AM |

Why can't E-cores have narrow/slow AVX-512? | Dummond D. Slow | 2022/08/02 03:02 PM |

AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/08/02 01:19 PM |

AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/02 09:09 PM |

AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/03 12:50 AM |

AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:15 AM |

AVX-512 unfriendly to heter-performance cores | -.- | 2022/08/03 08:17 PM |

AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:02 PM |

IFMA: empty promises from Intel as usual | Kent R | 2022/07/29 07:15 PM |

No hype lasts forever | Anon | 2022/07/30 08:06 AM |

Big integer multiplication with vector IFMA | me | 2022/07/30 09:15 AM |

Computations on Big Integers | --- | 2022/07/26 09:48 AM |

Computations on Big Integers | none | 2022/07/27 01:10 AM |

Computations on Big Integers | --- | 2022/07/28 11:43 AM |

Computations on Big Integers | --- | 2022/07/28 06:44 PM |

Computations on Big Integers | dmcq | 2022/07/26 02:27 PM |

Computations on Big Integers | Adrian | 2022/07/27 08:15 AM |

Computations on Big Integers | Brett | 2022/07/27 11:07 AM |

Yitian 710 | Wes Felter | 2021/10/21 12:51 PM |

Yitian 710 | Adrian | 2021/10/21 01:25 PM |

Yitian 710 | Anon | 2021/10/21 06:08 AM |

Strange definition of the word single. (NT) | anon2 | 2021/10/21 05:00 PM |

AMD Epyc uses chiplets. This is why "strange"? | Mark Roulo | 2021/10/21 05:08 PM |

AMD Epyc uses chiplets. This is why "strange"? | anon2 | 2021/10/21 05:34 PM |

Yeah. Blame spec.org, too, though! | Mark Roulo | 2021/10/21 05:58 PM |

Yeah. Blame spec.org, too, though! | anon2 | 2021/10/21 08:07 PM |

Yeah. Blame spec.org, too, though! | Björn Ragnar Björnsson | 2022/07/17 06:23 AM |

Yeah. Blame spec.org, too, though! | Rayla | 2022/07/17 09:13 AM |

Yeah. Blame spec.org, too, though! | Anon | 2022/07/17 12:01 PM |