By: Mark Roulo (nothanks.delete@this.xxx.com), November 4, 2022 8:34 pm

Room: Moderated Discussions

Jeffrey Bosboom (firstinitiallastname.delete@this.firstnamelastname.com) on November 4, 2022 6:18 pm wrote:

> From a recent Agner Fog forum post:

>

>

>

> I understand how cracking a 2n-bit instruction into two n-bit instructions and executing them sequentially

> saves area compared to a full 2n-bit-wide unit. But what is the difference between one 2n-bit unit and

> two n-bit units that can execute a 2n-bit instruction at full rate when paired? Or from the other direction,

> why wouldn't a full 2n-bit unit also be designed to execute two n-bit instructions simultaneously?

Is your question: Why would a CPU not allow two independent 256-bit vector instructions to execute simultaneously in the top and bottom halves of a 512-bit vector?

> From a recent Agner Fog forum post:

>

>

The support for the new AVX512 instructions is quite good, and it includes many of the extra subsets of

> AVX512. Here, I have to correct a common misunderstanding. The Zen 4 does not execute a 512-bit vector

> instruction by using a 256-bit execution unit twice, but by using two 256-bit units simultaneously. It

> does not split a 512-bit instruction into two 256-bit micro-operations, like the Zen 1 that splits 256-bit

> instructions into two 128-bit micro-operations. The Zen 4 has four 256-bit execution units. Two of these

> units can do floating point addition, and the other two can do floating point multiplication. All four

> can do integer vector addition etc. This gives a maximum throughput for 512-bit vectors of one floating

> point vector multiplication and one floating point vector addition, or two integer vector additions, per

> clock cycle. This throughput is doubled for vectors of 256 bits or less. It is still advantageous to use

> 512-bit instructions if the throughput is limited by instruction decoding or micro-operation queues or

> code cache or something else. It is rare that execution unit throughput is the bottleneck.

>

> I understand how cracking a 2n-bit instruction into two n-bit instructions and executing them sequentially

> saves area compared to a full 2n-bit-wide unit. But what is the difference between one 2n-bit unit and

> two n-bit units that can execute a 2n-bit instruction at full rate when paired? Or from the other direction,

> why wouldn't a full 2n-bit unit also be designed to execute two n-bit instructions simultaneously?

Is your question: Why would a CPU not allow two independent 256-bit vector instructions to execute simultaneously in the top and bottom halves of a 512-bit vector?