By: Brendan (btrotter.delete@this.gmail.com), November 5, 2022 2:07 am

Room: Moderated Discussions

Jeffrey Bosboom (firstinitiallastname.delete@this.firstnamelastname.com) on November 4, 2022 6:18 pm wrote:

> From a recent Agner Fog forum post:

>

>

>

> I understand how cracking a 2n-bit instruction into two n-bit instructions and executing them sequentially

> saves area compared to a full 2n-bit-wide unit. But what is the difference between one 2n-bit unit and

> two n-bit units that can execute a 2n-bit instruction at full rate when paired?

There's instructions that can't be nicely split with halves executed separately; like permutes, shuffles, conversions (e.g. 8 doubles -> 8 floats), etc. For "2 separate N-bit units" these cases become an ugly a mess, and for "one 2n-bit unit" there's no problem.

> Or from the other direction,

> why wouldn't a full 2n-bit unit also be designed to execute two n-bit instructions simultaneously?

I'd expect that to be much much easier (if you can afford the extra die area).

- Brendan

> From a recent Agner Fog forum post:

>

>

The support for the new AVX512 instructions is quite good, and it includes many of the extra subsets of

> AVX512. Here, I have to correct a common misunderstanding. The Zen 4 does not execute a 512-bit vector

> instruction by using a 256-bit execution unit twice, but by using two 256-bit units simultaneously. It

> does not split a 512-bit instruction into two 256-bit micro-operations, like the Zen 1 that splits 256-bit

> instructions into two 128-bit micro-operations. The Zen 4 has four 256-bit execution units. Two of these

> units can do floating point addition, and the other two can do floating point multiplication. All four

> can do integer vector addition etc. This gives a maximum throughput for 512-bit vectors of one floating

> point vector multiplication and one floating point vector addition, or two integer vector additions, per

> clock cycle. This throughput is doubled for vectors of 256 bits or less. It is still advantageous to use

> 512-bit instructions if the throughput is limited by instruction decoding or micro-operation queues or

> code cache or something else. It is rare that execution unit throughput is the bottleneck.

>

> I understand how cracking a 2n-bit instruction into two n-bit instructions and executing them sequentially

> saves area compared to a full 2n-bit-wide unit. But what is the difference between one 2n-bit unit and

> two n-bit units that can execute a 2n-bit instruction at full rate when paired?

There's instructions that can't be nicely split with halves executed separately; like permutes, shuffles, conversions (e.g. 8 doubles -> 8 floats), etc. For "2 separate N-bit units" these cases become an ugly a mess, and for "one 2n-bit unit" there's no problem.

> Or from the other direction,

> why wouldn't a full 2n-bit unit also be designed to execute two n-bit instructions simultaneously?

I'd expect that to be much much easier (if you can afford the extra die area).

- Brendan