By: Heikki Kultala (heikki.kultala.delete@this.gmail.com), November 9, 2022 9:07 am
Room: Moderated Discussions
Björn Ragnar Björnsson (bjorn.ragnar.delete@this.gmail.com) on November 8, 2022 8:24 pm wrote:
> Indeed they do not scale, so I would like to remind the folks in this discussion of the
> fact AMD did something special for full width (512 bits) shuffle. Alexander Yee tested Zen4
> AVX-512 for AMD shortlt before Zen4 release and came to the conclusion that Zen4 can do
> full width shuffles at 1/cycle. His guess is that Zen4 has two shuffle units, one 256 bits
> and one 512 bit, the bigger one being able function as 2 256 bit shuffle units.
>
> Additionally, Alexander has a small "Editorial comment" where he preemptively reinforces Linus' points:
>
> "In my opinion, Intel's mistake with AVX512 is to optimize for the 100% FMA workloads (namely
> Linpack) instead of the more common mixed FADD/FMA workloads. Adders are cheap. Multipliers
> are expensive. One of each would do just fine for most workloads. Instead, Intel decided to
> add a 2nd FMA to Skylake X/SP... It is that 2nd FMA which caused most of the power/throttling
> issues that has tainted AVX512's reputation and hindered its adoption."
FP adders are not so much cheaper than FP multipliers, in some cases they can even be more expensive than standalone FP multipliers.
The cost is not in the calculation itself, but in the alignment of operands and normalization in the end.
FP multiplication does not need alignment for inputs, but FP addition requires alignment for inputs.
This cost of alignment of operands and normalization is problem especially in CPUs where fast latency for operations is desired. The optimizations to make these have faster latency are very expensive on area and power.
However, FMA requires wider adder than what is required for standalone adder, which also makes the normalization wider, so FMA is still always much more expensive than adder.
> Indeed they do not scale, so I would like to remind the folks in this discussion of the
> fact AMD did something special for full width (512 bits) shuffle. Alexander Yee tested Zen4
> AVX-512 for AMD shortlt before Zen4 release and came to the conclusion that Zen4 can do
> full width shuffles at 1/cycle. His guess is that Zen4 has two shuffle units, one 256 bits
> and one 512 bit, the bigger one being able function as 2 256 bit shuffle units.
>
> Additionally, Alexander has a small "Editorial comment" where he preemptively reinforces Linus' points:
>
> "In my opinion, Intel's mistake with AVX512 is to optimize for the 100% FMA workloads (namely
> Linpack) instead of the more common mixed FADD/FMA workloads. Adders are cheap. Multipliers
> are expensive. One of each would do just fine for most workloads. Instead, Intel decided to
> add a 2nd FMA to Skylake X/SP... It is that 2nd FMA which caused most of the power/throttling
> issues that has tainted AVX512's reputation and hindered its adoption."
FP adders are not so much cheaper than FP multipliers, in some cases they can even be more expensive than standalone FP multipliers.
The cost is not in the calculation itself, but in the alignment of operands and normalization in the end.
FP multiplication does not need alignment for inputs, but FP addition requires alignment for inputs.
This cost of alignment of operands and normalization is problem especially in CPUs where fast latency for operations is desired. The optimizations to make these have faster latency are very expensive on area and power.
However, FMA requires wider adder than what is required for standalone adder, which also makes the normalization wider, so FMA is still always much more expensive than adder.