FP Adders are not so cheap compared to FP multipliers

By: Björn Ragnar Björnsson (bjorn.ragnar.delete@this.gmail.com), November 10, 2022 12:10 am
Room: Moderated Discussions
Heikki Kultala (heikki.kultala.delete@this.gmail.com) on November 9, 2022 8:07 am wrote:
> Björn Ragnar Björnsson (bjorn.ragnar.delete@this.gmail.com) on November 8, 2022 8:24 pm wrote:
>
> > Indeed they do not scale, so I would like to remind the folks in this discussion of the
> > fact AMD did something special for full width (512 bits) shuffle. Alexander Yee tested Zen4
> > AVX-512 for AMD shortlt before Zen4 release and came to the conclusion that Zen4 can do
> > full width shuffles at 1/cycle. His guess is that Zen4 has two shuffle units, one 256 bits
> > and one 512 bit, the bigger one being able function as 2 256 bit shuffle units.
> >
> > Additionally, Alexander has a small "Editorial comment" where he preemptively reinforces Linus' points:
> >
> > "In my opinion, Intel's mistake with AVX512 is to optimize for the 100% FMA workloads (namely
> > Linpack) instead of the more common mixed FADD/FMA workloads. Adders are cheap. Multipliers
> > are expensive. One of each would do just fine for most workloads. Instead, Intel decided to
> > add a 2nd FMA to Skylake X/SP... It is that 2nd FMA which caused most of the power/throttling
> > issues that has tainted AVX512's reputation and hindered its adoption."
>
> FP adders are not so much cheaper than FP multipliers, in some cases
> they can even be more expensive than standalone FP multipliers.
>
> The cost is not in the calculation itself, but in the alignment of operands and normalization in the end.
>
> FP multiplication does not need alignment for inputs, but FP addition requires alignment for inputs.
>
> This cost of alignment of operands and normalization is problem especially in CPUs where fast latency for operations
> is desired. The optimizations to make these have faster latency are very expensive on area and power.
>
> However, FMA requires wider adder than what is required for standalone adder, which also
> makes the normalization wider, so FMA is still always much more expensive than adder.

Ahhh, but Alexander specifically criticized the addition of a second 512bit FMA.
In his view, what broke the camel's back. Streaming non-blocking FMAs without
memory bandwidth constraints will throttle the processor. But still, adders are
cheaper where it counts, alignment (argument value alignment) will be a minor
in nearly all of the instances where user's have an inkling of numerical analysis
and digital computation, ie doing something sensible or "one off".

So, as far as I can tell we are all in agreement re adders, multipliers, FMAs
and what sort of analysis should govern our silicon resource allocation as if
anyone would deem us worthy of making such a choice (Linus, mildly exempted).
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Jeffrey Bosboom2022/11/04 06:18 PM
  Clarification?Mark Roulo2022/11/04 08:34 PM
    Expanded question about design pointsJeffrey Bosboom2022/11/04 10:37 PM
      Expanded question about design pointsAnon2022/11/04 10:53 PM
        Expanded question about design pointsJeffrey Bosboom2022/11/04 11:05 PM
          Expanded question about design pointsAnon2022/11/04 11:30 PM
            Expanded question about design pointsChester2022/11/05 04:24 PM
              Expanded question about design pointsAnon2022/11/05 04:43 PM
              Expanded question about design pointsLinus Torvalds2022/11/06 02:18 PM
                Expanded question about design pointsAdrian2022/11/07 04:38 AM
                  Expanded question about design pointsanon2022/11/07 12:34 PM
                    Expanded question about design pointsAdrian2022/11/08 04:34 AM
                      Expanded question about design pointsChester2022/11/08 08:29 AM
                      Expanded question about design pointsanon2022/11/08 09:01 AM
                        Expanded question about design pointsAdrian2022/11/08 09:53 AM
                          Expanded question about design pointsLinus Torvalds2022/11/08 11:35 AM
                            Expanded question about design pointsBrett2022/11/08 12:33 PM
                              Expanded question about design pointsBrett2022/11/08 12:48 PM
                              Expanded question about design points---2022/11/09 11:41 AM
                            Expanded question about design pointsAdrian2022/11/08 12:45 PM
                              Expanded question about design pointsLinus Torvalds2022/11/08 01:29 PM
                                Expanded question about design pointsanon2022/11/08 01:58 PM
                              Zen 4cJames2022/11/09 03:54 AM
                                Zen 4cAndrew Clough2022/11/09 05:59 AM
                                  Zen 4canonymou52022/11/09 12:29 PM
                                    Zen 4cChester2022/11/09 09:12 PM
                            Expanded question about design pointsBjörn Ragnar Björnsson2022/11/08 09:24 PM
                              FP Adders are not so cheap compared to FP multipliersHeikki Kultala2022/11/09 09:07 AM
                                FP Adders are not so cheap compared to FP multipliersBjörn Ragnar Björnsson2022/11/10 12:10 AM
                          Expanded question about design pointsAnon2022/11/08 06:31 PM
      Expanded question about design pointsAdrian2022/11/05 03:00 AM
        Expanded question about design pointsAnon2022/11/05 03:27 AM
          Expanded question about design pointsAdrian2022/11/05 03:50 AM
            Expanded question about design pointsAnon2022/11/05 04:10 AM
              Expanded question about design pointsAdrian2022/11/05 07:34 AM
        Expanded question about design pointshobold2022/11/06 04:48 AM
          Expanded question about design pointsAdrian2022/11/07 04:19 AM
            Expanded question about design pointsAdrian2022/11/07 09:07 AM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Anon2022/11/04 08:49 PM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512noko2022/11/04 09:49 PM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Brendan2022/11/05 02:07 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊