because per-core perf has stagnated

By: Michael S (, May 20, 2019 6:15 am
Room: Moderated Discussions
Adrian ( on May 20, 2019 4:28 am wrote:
> Adrian ( on May 20, 2019 3:12 am wrote:
> > Michael S ( on May 20, 2019 1:48 am wrote:
> > >
> > > BMI and ADX are minor improvements, but AVX tends to ether not help at all or help rather significantly.
> > >
> >
> > Most applications use large integer arithmetic only for cryptography and that takes
> > a small part of the execution time so the effect of BMI2 & ADX is minor indeed.
> >
> >
> > For niche applications with a lot of large integer arithmetic the effect is major.
> >
> >
> > IIRC, BMI2 has reduced the GMP multiplication times from
> > more than 3 cycles per limb to less than 2 cycles per
> > limb and ADX reduced it a little more, so I believe that the total speedup was by around a factor of 2.
> >
> >
> >
> I want to add, for those not familiar with these Intel instructions, which is their purpose.
> Sandy Bridge was the first Intel CPU that had an integer multiplier faster than those available on
> AMD CPUs, i.e. an integer multiplier capable of a throughput of one multiplication per cycle.
> The older AMD CPUs had, since 2003 until 2011, integer multipliers
> having a maximum throughput of one multiplication per 2 cycles.
> Immediately after Sandy Bridge, which was the culmination of several generations of Intel CPUs that had faster
> and faster integer multipliers, in order to match or exceed the AMD performance, AMD shot themselves in the
> foot by launching Bulldozer, whose integer multipliers were twice slower than in the previous AMD CPUs.
> Nevertheless, it was soon discovered that in most applications on Sandy Bridge it was impossible to reach
> the maximum throughput of the integer multiplier, because the legacy multiplication instructions, with
> fixed destinations and which clobbered the flags, required a lot of additional instructions for moving
> or copying data, so the achieved throughput was much lower than the capabilities of the multiplier.
> When the new instructions included in the BMI2 (Haswell) & ADX (Broadwell) groups are
> used for large integer multiplication, instead of the legacy instructions, then it is
> easy to fully utilize the integer multiplier, achieving its maximal throughput.

The funny thing here is that microarchitectural improvements introduced more or less simultaneously with this architecture extensions and even mores so in Skylake, like more generic handling of register moves at decoder and renamer as well as increased amount of rename registers for arithmetic flags, made use of this extensions by SW less beneficiary.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Intel's roadmapLaurent2019/05/13 07:37 AM
  Intel's roadmapAlberto2019/05/13 08:44 AM
    Intel's roadmapblue2019/05/13 09:26 AM
    Intel's roadmapMaynard Handley2019/05/13 10:04 AM
      Intel's roadmapAdrian2019/05/13 12:15 PM
      Actually not bad for IntelChester Lam2019/05/14 04:26 PM
        Actually not bad for IntelMaynard Handley2019/05/14 05:33 PM
          Actually not bad for IntelChester Lam2019/05/14 07:52 PM
            Easily, just grab the LN2... (NT)blue2019/05/14 09:41 PM
            Actually not bad for IntelMaynard Handley2019/05/14 10:32 PM
              Application mattersChester Lam2019/05/15 02:15 AM
      Intel's roadmapAlberto2019/05/15 06:58 AM
        Intel's roadmapnone2019/05/15 07:25 AM
        Intel's roadmapChester Lam2019/05/15 07:32 AM
          Sh*** is Sh** foreverAlberto2019/05/15 07:47 AM
            Sh*** is Sh** forevernone2019/05/15 08:05 AM
              benchmarks...Chester Lam2019/05/15 08:33 AM
                benchmarks...none2019/05/15 09:09 AM
                  benchmarks...Chester Lam2019/05/15 03:51 PM
                    benchmarks...Doug S2019/05/16 12:10 PM
                      benchmarks...chester lam2019/05/16 02:20 PM
                        benchmarks...Doug S2019/05/16 02:28 PM
                          benchmarks...chester lam2019/05/16 03:00 PM
                            benchmarks...Doug S2019/05/17 02:39 AM
                              benchmarks...Chester Lam2019/05/17 03:54 AM
                                benchmarks...Doug S2019/05/17 10:52 AM
                                  because per-core perf has stagnatedchester lam2019/05/17 12:55 PM
                                    because per-core perf has stagnatedNathan2019/05/17 01:54 PM
                                      because per-core perf has stagnatedAdrian2019/05/17 09:39 PM
                                        because per-core perf has stagnatedchester lam2019/05/19 05:32 PM
                                          because per-core perf has stagnatedAdrian2019/05/19 08:09 PM
                                            because per-core perf has stagnatedFoo_2019/05/19 11:58 PM
                                              because per-core perf has stagnatedMichael S2019/05/20 12:48 AM
                                                because per-core perf has stagnatedAdrian2019/05/20 02:12 AM
                                                  because per-core perf has stagnatedMichael S2019/05/20 03:23 AM
                                                    because per-core perf has stagnatedMichael S2019/05/20 06:07 AM
                                                  because per-core perf has stagnatedAdrian2019/05/20 03:28 AM
                                                    because per-core perf has stagnatedMichael S2019/05/20 06:15 AM
                                                  because per-core perf has stagnatednone2019/05/20 03:41 AM
                      benchmarks...Maynard Handley2019/05/16 03:35 PM
                        benchmarks...dmcq2019/05/20 07:15 AM
                    benchmarks...Maxwell2019/05/16 09:47 PM
            Sh*** is Sh** foreverAnon2019/05/15 08:43 AM
    Intel's roadmapDoug S2019/05/13 12:24 PM
      Intel's roadmapwumpus2019/05/14 06:52 AM
      Intel's roadmapAlberto2019/05/15 07:10 AM
Reply to this Topic
Body: No Text
How do you spell avocado?