By: Jeffrey Bosboom (firstinitiallastname.delete@this.firstnamelastname.com), November 4, 2022 5:18 pm

Room: Moderated Discussions

From a recent Agner Fog forum post:

I understand how cracking a 2n-bit instruction into two n-bit instructions and executing them sequentially saves area compared to a full 2n-bit-wide unit. But what is the difference between one 2n-bit unit and two n-bit units that can execute a 2n-bit instruction at full rate when paired? Or from the other direction, why wouldn't a full 2n-bit unit also be designed to execute two n-bit instructions simultaneously?

The support for the new AVX512 instructions is quite good, and it includes many of the extra subsets of AVX512. Here, I have to correct a common misunderstanding. The Zen 4 does not execute a 512-bit vector instruction by using a 256-bit execution unit twice, but by using two 256-bit units simultaneously. It does not split a 512-bit instruction into two 256-bit micro-operations, like the Zen 1 that splits 256-bit instructions into two 128-bit micro-operations. The Zen 4 has four 256-bit execution units. Two of these units can do floating point addition, and the other two can do floating point multiplication. All four can do integer vector addition etc. This gives a maximum throughput for 512-bit vectors of one floating point vector multiplication and one floating point vector addition, or two integer vector additions, per clock cycle. This throughput is doubled for vectors of 256 bits or less. It is still advantageous to use 512-bit instructions if the throughput is limited by instruction decoding or micro-operation queues or code cache or something else. It is rare that execution unit throughput is the bottleneck.

I understand how cracking a 2n-bit instruction into two n-bit instructions and executing them sequentially saves area compared to a full 2n-bit-wide unit. But what is the difference between one 2n-bit unit and two n-bit units that can execute a 2n-bit instruction at full rate when paired? Or from the other direction, why wouldn't a full 2n-bit unit also be designed to execute two n-bit instructions simultaneously?