By: --- (---.delete@this.redheron.com), July 26, 2022 9:48 am
Room: Moderated Discussions
none (none.delete@this.none.com) on July 25, 2022 11:35 pm wrote:
> Bill G (bill.delete@this.g.com) on July 25, 2022 10:06 pm wrote:
> > Adrian (a.delete@this.acm.org) on July 16, 2022 11:19 pm wrote:
> > > for the computations dominated by the use of floating-point numbers or big integer numbers,
> > > the AMD and Intel CPUs remain without competition (except from GPUs).
> >
> > What is it about AMD and Intel CPUs that make them good for computations on big integers? ARM
> > has add, subtract with carry flag instructions (ADC, SBC) and multiply, multiply-accumulate producing
> > the full width product (MULL, MLAL). x86 has similar instructions (ADCX, ADOX, MULX).
>
> I'm not sure what Adrian claim is, but Apple M1 likely is the fastest per clock CPU for GMP.
> See https://gmplib.org/gmpbench
The M1 number is old'ish, and I suspect that GMP on M1 is not as optimized as it could be.
The last time I (very quickly) looked at this, a performance gate was the handling of flags (carry etc) and the GMP ARMv8 code was not exploiting the fact that on M1 (I can't speak to other ARM cores) you can very rapidly move these to and from integers. I sent them an email about this, but who knows when it will become productized?
Like any new architecture, optimization takes time, and the more specialized the code, the longer it takes for all the pieces to line up -- the relevant people have to acquire the relevant CPUs, then have to become familiar with the performance characteristics, then have to figure/learn/disseminate non-obvious tricks, etc.
> What matters (beyond what you wrote) is the number of integer multiplications you can issue
> per clock. For AArch64, you don't have 64x64->128 bit multiply instructions, so you have to
> use more 64x64->64 bit ones (AArch64 has 64x64->64 low and high, so you need 2 such
> instructions; a possibility, not implemented as far as I know, would be to fuse two such
> multiplies).
> Bill G (bill.delete@this.g.com) on July 25, 2022 10:06 pm wrote:
> > Adrian (a.delete@this.acm.org) on July 16, 2022 11:19 pm wrote:
> > > for the computations dominated by the use of floating-point numbers or big integer numbers,
> > > the AMD and Intel CPUs remain without competition (except from GPUs).
> >
> > What is it about AMD and Intel CPUs that make them good for computations on big integers? ARM
> > has add, subtract with carry flag instructions (ADC, SBC) and multiply, multiply-accumulate producing
> > the full width product (MULL, MLAL). x86 has similar instructions (ADCX, ADOX, MULX).
>
> I'm not sure what Adrian claim is, but Apple M1 likely is the fastest per clock CPU for GMP.
> See https://gmplib.org/gmpbench
The M1 number is old'ish, and I suspect that GMP on M1 is not as optimized as it could be.
The last time I (very quickly) looked at this, a performance gate was the handling of flags (carry etc) and the GMP ARMv8 code was not exploiting the fact that on M1 (I can't speak to other ARM cores) you can very rapidly move these to and from integers. I sent them an email about this, but who knows when it will become productized?
Like any new architecture, optimization takes time, and the more specialized the code, the longer it takes for all the pieces to line up -- the relevant people have to acquire the relevant CPUs, then have to become familiar with the performance characteristics, then have to figure/learn/disseminate non-obvious tricks, etc.
> What matters (beyond what you wrote) is the number of integer multiplications you can issue
> per clock. For AArch64, you don't have 64x64->128 bit multiply instructions, so you have to
> use more 64x64->64 bit ones (AArch64 has 64x64->64 low and high, so you need 2 such
> instructions; a possibility, not implemented as far as I know, would be to fuse two such
> multiplies).
Topic | Posted By | Date |
---|---|---|
Yitian 710 | anonymous2 | 2021/10/20 08:57 PM |
Yitian 710 | Adrian | 2021/10/21 12:20 AM |
Yitian 710 | Wilco | 2021/10/21 03:47 AM |
Yitian 710 | Rayla | 2021/10/21 05:52 AM |
Yitian 710 | Wilco | 2021/10/21 11:59 AM |
Yitian 710 | anon2 | 2021/10/21 05:16 PM |
Yitian 710 | Wilco | 2022/07/16 12:21 PM |
Yitian 710 | Anon | 2022/07/16 08:22 PM |
Yitian 710 | Rayla | 2022/07/17 09:10 AM |
Yitian 710 | Anon | 2022/07/17 12:04 PM |
Yitian 710 | Rayla | 2022/07/17 12:08 PM |
Yitian 710 | Wilco | 2022/07/17 01:16 PM |
Yitian 710 | Anon | 2022/07/17 01:32 PM |
Yitian 710 | Wilco | 2022/07/17 02:22 PM |
Yitian 710 | Anon | 2022/07/17 02:47 PM |
Yitian 710 | Wilco | 2022/07/17 03:50 PM |
Yitian 710 | Anon | 2022/07/17 08:46 PM |
Yitian 710 | Wilco | 2022/07/18 03:01 AM |
Yitian 710 | Anon | 2022/07/19 11:21 AM |
Yitian 710 | Wilco | 2022/07/19 06:15 PM |
Yitian 710 | Anon | 2022/07/21 01:25 AM |
Yitian 710 | none | 2022/07/21 01:49 AM |
Yitian 710 | Anon | 2022/07/21 03:03 AM |
Yitian 710 | none | 2022/07/21 04:34 AM |
Yitian 710 | James | 2022/07/21 02:29 AM |
Yitian 710 | Anon | 2022/07/21 03:05 AM |
Yitian 710 | Wilco | 2022/07/21 04:31 AM |
Yitian 710 | Anon | 2022/07/21 05:17 AM |
Yitian 710 | Wilco | 2022/07/21 05:33 AM |
Yitian 710 | Anon | 2022/07/21 05:50 AM |
Yitian 710 | Wilco | 2022/07/21 06:07 AM |
Yitian 710 | Anon | 2022/07/21 06:20 AM |
Yitian 710 | Wilco | 2022/07/21 10:02 AM |
Yitian 710 | Anon | 2022/07/21 10:22 AM |
Yitian 710 | Adrian | 2022/07/17 11:09 PM |
Yitian 710 | Wilco | 2022/07/18 01:15 AM |
Yitian 710 | Adrian | 2022/07/18 02:35 AM |
Yitian 710 | Adrian | 2022/07/16 11:19 PM |
Computations on Big Integers | Bill G | 2022/07/25 10:06 PM |
Computations on Big Integers | none | 2022/07/25 11:35 PM |
x86 MUL 64x64 | Eric Fink | 2022/07/26 01:06 AM |
x86 MUL 64x64 | Adrian | 2022/07/26 02:27 AM |
x86 MUL 64x64 | none | 2022/07/26 02:38 AM |
x86 MUL 64x64 | Jörn Engel | 2022/07/26 10:17 AM |
x86 MUL 64x64 | Linus Torvalds | 2022/07/27 10:13 AM |
x86 MUL 64x64 | ⚛ | 2022/07/28 09:40 AM |
x86 MUL 64x64 | Jörn Engel | 2022/07/28 10:18 AM |
More than 3 registers per instruction | -.- | 2022/07/28 07:01 PM |
More than 3 registers per instruction | Anon | 2022/07/28 10:39 PM |
More than 3 registers per instruction | Jörn Engel | 2022/07/28 10:42 PM |
More than 3 registers per instruction | -.- | 2022/07/29 04:31 AM |
Computations on Big Integers | Bill G | 2022/07/26 01:40 AM |
Computations on Big Integers | none | 2022/07/26 02:17 AM |
Computations on Big Integers | Bill G | 2022/07/26 03:52 AM |
Computations on Big Integers | --- | 2022/07/26 09:57 AM |
Computations on Big Integers | Adrian | 2022/07/26 02:53 AM |
Computations on Big Integers | Bill G | 2022/07/26 03:39 AM |
Computations on Big Integers | Adrian | 2022/07/26 04:21 AM |
Computations on Big Integers in Apple AMX Units | Bill G | 2022/07/26 04:28 AM |
Computations on Big Integers in Apple AMX Units | Adrian | 2022/07/26 05:13 AM |
Typo | Adrian | 2022/07/26 05:20 AM |
IEEE binary64 is 53 bits rather than 52. (NT) | Michael S | 2022/07/26 05:34 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/26 07:32 AM |
IEEE binary64 is 53 bits rather than 52. | Michael S | 2022/07/26 10:02 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 06:58 AM |
IEEE binary64 is 53 bits rather than 52. | none | 2022/07/27 07:14 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 07:55 AM |
Thanks a lot for the link to the article! (NT) | none | 2022/07/27 08:09 AM |
Typo | zArchJon | 2022/07/26 09:51 AM |
Typo | Michael S | 2022/07/26 10:25 AM |
Typo | zArchJon | 2022/07/26 11:52 AM |
Typo | Michael S | 2022/07/26 01:02 PM |
Computations on Big Integers | Michael S | 2022/07/26 05:55 AM |
Computations on Big Integers | Adrian | 2022/07/26 07:59 AM |
IFMA and Division | Bill G | 2022/07/26 04:25 PM |
IFMA and Division | rwessel | 2022/07/26 08:16 PM |
IFMA and Division | Adrian | 2022/07/27 07:25 AM |
Computations on Big Integers | none | 2022/07/27 01:22 AM |
Big integer multiplication with vector IFMA | Bill G | 2022/07/29 01:06 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/29 01:35 AM |
Big integer multiplication with vector IFMA | -.- | 2022/07/29 04:32 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/29 09:47 PM |
Big integer multiplication with vector IFMA | Anon | 2022/07/30 08:12 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/30 09:27 AM |
AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/07/31 03:20 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/07/31 03:33 PM |
AVX-512 unfriendly to heter-performance cores | anonymou5 | 2022/07/31 05:03 PM |
AVX-512 unfriendly to heter-performance cores | Brett | 2022/07/31 07:26 PM |
AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/01 01:45 AM |
Why can't E-cores have narrow/slow AVX-512? (NT) | anonymous2 | 2022/08/01 03:37 PM |
Why can't E-cores have narrow/slow AVX-512? | Ivan | 2022/08/02 12:09 AM |
Why can't E-cores have narrow/slow AVX-512? | anonymou5 | 2022/08/02 10:13 AM |
Why can't E-cores have narrow/slow AVX-512? | Dummond D. Slow | 2022/08/02 03:02 PM |
AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/08/02 01:19 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/02 09:09 PM |
AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/03 12:50 AM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:15 AM |
AVX-512 unfriendly to heter-performance cores | -.- | 2022/08/03 08:17 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:02 PM |
IFMA: empty promises from Intel as usual | Kent R | 2022/07/29 07:15 PM |
No hype lasts forever | Anon | 2022/07/30 08:06 AM |
Big integer multiplication with vector IFMA | me | 2022/07/30 09:15 AM |
Computations on Big Integers | --- | 2022/07/26 09:48 AM |
Computations on Big Integers | none | 2022/07/27 01:10 AM |
Computations on Big Integers | --- | 2022/07/28 11:43 AM |
Computations on Big Integers | --- | 2022/07/28 06:44 PM |
Computations on Big Integers | dmcq | 2022/07/26 02:27 PM |
Computations on Big Integers | Adrian | 2022/07/27 08:15 AM |
Computations on Big Integers | Brett | 2022/07/27 11:07 AM |
Yitian 710 | Wes Felter | 2021/10/21 12:51 PM |
Yitian 710 | Adrian | 2021/10/21 01:25 PM |
Yitian 710 | Anon | 2021/10/21 06:08 AM |
Strange definition of the word single. (NT) | anon2 | 2021/10/21 05:00 PM |
AMD Epyc uses chiplets. This is why "strange"? | Mark Roulo | 2021/10/21 05:08 PM |
AMD Epyc uses chiplets. This is why "strange"? | anon2 | 2021/10/21 05:34 PM |
Yeah. Blame spec.org, too, though! | Mark Roulo | 2021/10/21 05:58 PM |
Yeah. Blame spec.org, too, though! | anon2 | 2021/10/21 08:07 PM |
Yeah. Blame spec.org, too, though! | Björn Ragnar Björnsson | 2022/07/17 06:23 AM |
Yeah. Blame spec.org, too, though! | Rayla | 2022/07/17 09:13 AM |
Yeah. Blame spec.org, too, though! | Anon | 2022/07/17 12:01 PM |