Computations on Big Integers

By: --- (---.delete@this.redheron.com), July 28, 2022 6:44 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on July 28, 2022 11:43 am wrote:
> none (none.delete@this.none.com) on July 27, 2022 1:10 am wrote:
> > --- (---.delete@this.redheron.com) on July 26, 2022 9:48 am wrote:
> > > none (none.delete@this.none.com) on July 25, 2022 11:35 pm wrote:
> > > > Bill G (bill.delete@this.g.com) on July 25, 2022 10:06 pm wrote:
> > > > > Adrian (a.delete@this.acm.org) on July 16, 2022 11:19 pm wrote:
> > > > > > for the computations dominated by the use of floating-point numbers or big integer numbers,
> > > > > > the AMD and Intel CPUs remain without competition (except from GPUs).
> > > > >
> > > > > What is it about AMD and Intel CPUs that make them good for computations on big integers? ARM
> > > > > has add, subtract with carry flag instructions (ADC, SBC) and multiply, multiply-accumulate producing
> > > > > the full width product (MULL, MLAL). x86 has similar instructions (ADCX, ADOX, MULX).
> > > >
> > > > I'm not sure what Adrian claim is, but Apple M1 likely is the fastest per clock CPU for GMP.
> > > > See https://gmplib.org/gmpbench
> > >
> > > The M1 number is old'ish, and I suspect that GMP on M1 is not as optimized as it could be.
> > > The last time I (very quickly) looked at this, a performance gate was the handling of
> > > flags (carry etc) and the GMP ARMv8 code was not exploiting the fact that on M1 (I can't
> > > speak to other ARM cores) you can very rapidly move these to and from integers. I sent
> > > them an email about this, but who knows when it will become productized?
> >
> > Can you show that code? Or did you just point out some potential improvement and are waiting
> > for them to do the development?
> >
> > For those interested (Maynard already knows that code I'm sure), here is the code:
> > https://gmplib.org/repo/gmp/file/tip/mpn/arm64/applem1/aorsmul_1.asm
> > The critical loop (for longer integers) starts at line 132. 4 64-bit limbs are processed per
> > iteration; this requires 8 muls (as I previously mentioned, one to get lower 64-bit and one
> > to get higher 64-bit per limb). The loop runs at 1.25 cycle per limb. The maximum achievable
> > given the integer multiplication resources of M1 (2 multiplications per cycle, no fusion)
> > would be 1 cycle per limb. I'm curious to see how this could be achieved.
>
> This is better than the code I saw when I looked at gmp generally a long
> time ago; in particular the CSINC is a nicer way to get at the carry.
> But the big problem is that you're not just throttled by multiplies, you're killed
> by overhead (which is hard to avoid if you want to keep the code manageable!).
>
> Essentially [with a little handwaving] each 64x64 product requires
> - one load pair (the two 64b values to be multiplied)
> - two multiplies
> - one load pair (the earlier 128bit product they will be added to)
> - two adds (of the new 128b with the earlier 128b)
> - one extraction of the carry from the two adds
> - store pair of the two adds
> - add in of the carry into the next round
>
> - then we have two instructions of loop control (amortized over loop unrolling)
> [actually Apple fuses sub+cbnz, so yay, that!]
>
> three load/stores
> two multiplies
> four arithmetics
> now we're already at 9 ops in just basic bandwidth. It's pretty damned amazing
> what superscalar OoO can do to convert that into just over one cycle!
>
> It's unclear how fiddling at the margins can alter the basic facts I've given. BUT
>
> The one thing you could do (unless I'm missing something) is
> propagate the carry IMPLICITLY from one loop to the next:
> Look at the main loop L(top), and how the variable CY works.
> Assume we ignore CY and start with carry appropriately primed.
> Then we execute line 144 as an ADDSUBC, and drop line 149;
> also drop lines 143 and 153 (fudges to access the high carry)
> and the carry from the previous round just propagates into the next round.
> I think you can rearrange the adds to get the carries in the correct order without ever having to
> extract them, and that gets us from 9 to 7 ops in the primary loop and may get us to 1 or so!...
>
> This may be tricky-to-impossible to do on x86 (with too many instructions, specifically the
> loop counter modifying the flags), but it seems feasible on ARMv8 (and similar, like POWER).

My bad. You can't get rid of both the carry overhead but you can get rid of one of them.
Easier to see the pattern with L(b10)
- 123 (generically) becomes an ADDSUBC (from the previous round)
- 125 IS required to capture one of the carries
- 128 is NOT required (carry wraps around to next iteration)

Then apply the same logic to all the various cases. This removes one instruction from the loop, which ain't great but is better than nothing!

Of course the other alternative (much more painful...) is to group bits by 26 and go down the double vector path.
As an initial experiment one could do just doubles -- you still get 4 multipliers but of course handling carry is more painful.
There may be scope for a kinda bizarre "midnum". What if you start distributing bits by 13 (16 may be easier in terms of extraction/shuffling) then as you perform the adds you just allow the carries to accumulate, so after one round the maximum size of an element may be 17 bits, then 18, then 19...
Allow this to proceed to 26 bits, then do the necessary shifting and packing to return to 13 or 16 bits per entry and repeat.
The win in this is that you have the same multiply throughput (either two mul each generating 64b, or four each generating 32b -- and growing each loop up to 52b) and you don't have to worry about the carry overhead except every 10 or so iterations?

It seems that this has been thought about
http://www.numberworld.org/y-cruncher/internals/representation.html#considerations
(what I am suggesting appears to be what he calls Redundant Representations)
but the tradeoffs favoring this may work better on M1 than on x86. In particular I'm hoping that the overhead may be small enough that we use this intermediate representation purely for the multiplies, so we have to "convert" (which may just be the appropriate width loads, or some loads then extract/permutes) on the way in and similar on the way out.
But honestly this is something I've not spent much time thinking about, so I'll defer to the experience of others.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Yitian 710 anonymous22021/10/20 08:57 PM
  Yitian 710 Adrian2021/10/21 12:20 AM
  Yitian 710 Wilco2021/10/21 03:47 AM
    Yitian 710 Rayla2021/10/21 05:52 AM
      Yitian 710 Wilco2021/10/21 11:59 AM
        Yitian 710 anon22021/10/21 05:16 PM
        Yitian 710 Wilco2022/07/16 12:21 PM
          Yitian 710 Anon2022/07/16 08:22 PM
            Yitian 710 Rayla2022/07/17 09:10 AM
              Yitian 710 Anon2022/07/17 12:04 PM
                Yitian 710 Rayla2022/07/17 12:08 PM
                  Yitian 710 Wilco2022/07/17 01:16 PM
                    Yitian 710 Anon2022/07/17 01:32 PM
                      Yitian 710 Wilco2022/07/17 02:22 PM
                        Yitian 710 Anon2022/07/17 02:47 PM
                          Yitian 710 Wilco2022/07/17 03:50 PM
                            Yitian 710 Anon2022/07/17 08:46 PM
                              Yitian 710 Wilco2022/07/18 03:01 AM
                                Yitian 710 Anon2022/07/19 11:21 AM
                                  Yitian 710 Wilco2022/07/19 06:15 PM
                                    Yitian 710 Anon2022/07/21 01:25 AM
                                      Yitian 710 none2022/07/21 01:49 AM
                                        Yitian 710 Anon2022/07/21 03:03 AM
                                          Yitian 710 none2022/07/21 04:34 AM
                                      Yitian 710 James2022/07/21 02:29 AM
                                        Yitian 710 Anon2022/07/21 03:05 AM
                                      Yitian 710 Wilco2022/07/21 04:31 AM
                                        Yitian 710 Anon2022/07/21 05:17 AM
                                          Yitian 710 Wilco2022/07/21 05:33 AM
                                            Yitian 710 Anon2022/07/21 05:50 AM
                                              Yitian 710 Wilco2022/07/21 06:07 AM
                                                Yitian 710 Anon2022/07/21 06:20 AM
                                                  Yitian 710 Wilco2022/07/21 10:02 AM
                                                    Yitian 710 Anon2022/07/21 10:22 AM
                    Yitian 710 Adrian2022/07/17 11:09 PM
                      Yitian 710 Wilco2022/07/18 01:15 AM
                        Yitian 710 Adrian2022/07/18 02:35 AM
          Yitian 710 Adrian2022/07/16 11:19 PM
            Computations on Big IntegersBill G2022/07/25 10:06 PM
              Computations on Big Integersnone2022/07/25 11:35 PM
                x86 MUL 64x64 Eric Fink2022/07/26 01:06 AM
                  x86 MUL 64x64 Adrian2022/07/26 02:27 AM
                  x86 MUL 64x64 none2022/07/26 02:38 AM
                    x86 MUL 64x64 Jörn Engel2022/07/26 10:17 AM
                      x86 MUL 64x64 Linus Torvalds2022/07/27 10:13 AM
                        x86 MUL 64x64 2022/07/28 09:40 AM
                        x86 MUL 64x64 Jörn Engel2022/07/28 10:18 AM
                          More than 3 registers per instruction-.-2022/07/28 07:01 PM
                            More than 3 registers per instructionAnon2022/07/28 10:39 PM
                            More than 3 registers per instructionJörn Engel2022/07/28 10:42 PM
                              More than 3 registers per instruction-.-2022/07/29 04:31 AM
                Computations on Big IntegersBill G2022/07/26 01:40 AM
                  Computations on Big Integersnone2022/07/26 02:17 AM
                    Computations on Big IntegersBill G2022/07/26 03:52 AM
                    Computations on Big Integers---2022/07/26 09:57 AM
                  Computations on Big IntegersAdrian2022/07/26 02:53 AM
                    Computations on Big IntegersBill G2022/07/26 03:39 AM
                      Computations on Big IntegersAdrian2022/07/26 04:21 AM
                    Computations on Big Integers in Apple AMX UnitsBill G2022/07/26 04:28 AM
                      Computations on Big Integers in Apple AMX UnitsAdrian2022/07/26 05:13 AM
                        TypoAdrian2022/07/26 05:20 AM
                          IEEE binary64 is 53 bits rather than 52. (NT)Michael S2022/07/26 05:34 AM
                            IEEE binary64 is 53 bits rather than 52.Adrian2022/07/26 07:32 AM
                              IEEE binary64 is 53 bits rather than 52.Michael S2022/07/26 10:02 AM
                                IEEE binary64 is 53 bits rather than 52.Adrian2022/07/27 06:58 AM
                                  IEEE binary64 is 53 bits rather than 52.none2022/07/27 07:14 AM
                                    IEEE binary64 is 53 bits rather than 52.Adrian2022/07/27 07:55 AM
                                      Thanks a lot for the link to the article! (NT)none2022/07/27 08:09 AM
                          TypozArchJon2022/07/26 09:51 AM
                            TypoMichael S2022/07/26 10:25 AM
                              TypozArchJon2022/07/26 11:52 AM
                                TypoMichael S2022/07/26 01:02 PM
                    Computations on Big IntegersMichael S2022/07/26 05:55 AM
                      Computations on Big IntegersAdrian2022/07/26 07:59 AM
                        IFMA and DivisionBill G2022/07/26 04:25 PM
                          IFMA and Divisionrwessel2022/07/26 08:16 PM
                          IFMA and DivisionAdrian2022/07/27 07:25 AM
                      Computations on Big Integersnone2022/07/27 01:22 AM
                    Big integer multiplication with vector IFMABill G2022/07/29 01:06 AM
                      Big integer multiplication with vector IFMAAdrian2022/07/29 01:35 AM
                        Big integer multiplication with vector IFMA-.-2022/07/29 04:32 AM
                          Big integer multiplication with vector IFMAAdrian2022/07/29 09:47 PM
                            Big integer multiplication with vector IFMAAnon2022/07/30 08:12 AM
                              Big integer multiplication with vector IFMAAdrian2022/07/30 09:27 AM
                                AVX-512 unfriendly to heter-performance coresPaul A. Clayton2022/07/31 03:20 PM
                                  AVX-512 unfriendly to heter-performance coresAnon2022/07/31 03:33 PM
                                    AVX-512 unfriendly to heter-performance coresanonymou52022/07/31 05:03 PM
                                  AVX-512 unfriendly to heter-performance coresBrett2022/07/31 07:26 PM
                                  AVX-512 unfriendly to heter-performance coresAdrian2022/08/01 01:45 AM
                                    Why can't E-cores have narrow/slow AVX-512? (NT)anonymous22022/08/01 03:37 PM
                                      Why can't E-cores have narrow/slow AVX-512?Ivan2022/08/02 12:09 AM
                                        Why can't E-cores have narrow/slow AVX-512?anonymou52022/08/02 10:13 AM
                                        Why can't E-cores have narrow/slow AVX-512?Dummond D. Slow2022/08/02 03:02 PM
                                    AVX-512 unfriendly to heter-performance coresPaul A. Clayton2022/08/02 01:19 PM
                                      AVX-512 unfriendly to heter-performance coresAnon2022/08/02 09:09 PM
                                      AVX-512 unfriendly to heter-performance coresAdrian2022/08/03 12:50 AM
                                        AVX-512 unfriendly to heter-performance coresAnon2022/08/03 09:15 AM
                                          AVX-512 unfriendly to heter-performance cores-.-2022/08/03 08:17 PM
                                            AVX-512 unfriendly to heter-performance coresAnon2022/08/03 09:02 PM
                        IFMA: empty promises from Intel as usualKent R2022/07/29 07:15 PM
                          No hype lasts foreverAnon2022/07/30 08:06 AM
                        Big integer multiplication with vector IFMAme2022/07/30 09:15 AM
                Computations on Big Integers---2022/07/26 09:48 AM
                  Computations on Big Integersnone2022/07/27 01:10 AM
                    Computations on Big Integers---2022/07/28 11:43 AM
                      Computations on Big Integers---2022/07/28 06:44 PM
              Computations on Big Integersdmcq2022/07/26 02:27 PM
                Computations on Big IntegersAdrian2022/07/27 08:15 AM
                  Computations on Big IntegersBrett2022/07/27 11:07 AM
      Yitian 710 Wes Felter2021/10/21 12:51 PM
        Yitian 710 Adrian2021/10/21 01:25 PM
    Yitian 710 Anon2021/10/21 06:08 AM
      Strange definition of the word single. (NT)anon22021/10/21 05:00 PM
        AMD Epyc uses chiplets. This is why "strange"?Mark Roulo2021/10/21 05:08 PM
          AMD Epyc uses chiplets. This is why "strange"?anon22021/10/21 05:34 PM
            Yeah. Blame spec.org, too, though!Mark Roulo2021/10/21 05:58 PM
              Yeah. Blame spec.org, too, though!anon22021/10/21 08:07 PM
                Yeah. Blame spec.org, too, though!Björn Ragnar Björnsson2022/07/17 06:23 AM
              Yeah. Blame spec.org, too, though!Rayla2022/07/17 09:13 AM
                Yeah. Blame spec.org, too, though!Anon2022/07/17 12:01 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊