By: Mark Roulo (nothanks.delete@this.xxx.com), November 4, 2022 8:34 pm
Room: Moderated Discussions
Jeffrey Bosboom (firstinitiallastname.delete@this.firstnamelastname.com) on November 4, 2022 6:18 pm wrote:
> From a recent Agner Fog forum post:
>
>
>
> I understand how cracking a 2n-bit instruction into two n-bit instructions and executing them sequentially
> saves area compared to a full 2n-bit-wide unit. But what is the difference between one 2n-bit unit and
> two n-bit units that can execute a 2n-bit instruction at full rate when paired? Or from the other direction,
> why wouldn't a full 2n-bit unit also be designed to execute two n-bit instructions simultaneously?
Is your question: Why would a CPU not allow two independent 256-bit vector instructions to execute simultaneously in the top and bottom halves of a 512-bit vector?
> From a recent Agner Fog forum post:
>
>
The support for the new AVX512 instructions is quite good, and it includes many of the extra subsets of
> AVX512. Here, I have to correct a common misunderstanding. The Zen 4 does not execute a 512-bit vector
> instruction by using a 256-bit execution unit twice, but by using two 256-bit units simultaneously. It
> does not split a 512-bit instruction into two 256-bit micro-operations, like the Zen 1 that splits 256-bit
> instructions into two 128-bit micro-operations. The Zen 4 has four 256-bit execution units. Two of these
> units can do floating point addition, and the other two can do floating point multiplication. All four
> can do integer vector addition etc. This gives a maximum throughput for 512-bit vectors of one floating
> point vector multiplication and one floating point vector addition, or two integer vector additions, per
> clock cycle. This throughput is doubled for vectors of 256 bits or less. It is still advantageous to use
> 512-bit instructions if the throughput is limited by instruction decoding or micro-operation queues or
> code cache or something else. It is rare that execution unit throughput is the bottleneck.
>
> I understand how cracking a 2n-bit instruction into two n-bit instructions and executing them sequentially
> saves area compared to a full 2n-bit-wide unit. But what is the difference between one 2n-bit unit and
> two n-bit units that can execute a 2n-bit instruction at full rate when paired? Or from the other direction,
> why wouldn't a full 2n-bit unit also be designed to execute two n-bit instructions simultaneously?
Is your question: Why would a CPU not allow two independent 256-bit vector instructions to execute simultaneously in the top and bottom halves of a 512-bit vector?