By: Brendan (btrotter.delete@this.gmail.com), November 5, 2022 2:07 am
Room: Moderated Discussions
Jeffrey Bosboom (firstinitiallastname.delete@this.firstnamelastname.com) on November 4, 2022 6:18 pm wrote:
> From a recent Agner Fog forum post:
>
>
>
> I understand how cracking a 2n-bit instruction into two n-bit instructions and executing them sequentially
> saves area compared to a full 2n-bit-wide unit. But what is the difference between one 2n-bit unit and
> two n-bit units that can execute a 2n-bit instruction at full rate when paired?
There's instructions that can't be nicely split with halves executed separately; like permutes, shuffles, conversions (e.g. 8 doubles -> 8 floats), etc. For "2 separate N-bit units" these cases become an ugly a mess, and for "one 2n-bit unit" there's no problem.
> Or from the other direction,
> why wouldn't a full 2n-bit unit also be designed to execute two n-bit instructions simultaneously?
I'd expect that to be much much easier (if you can afford the extra die area).
- Brendan
> From a recent Agner Fog forum post:
>
>
The support for the new AVX512 instructions is quite good, and it includes many of the extra subsets of
> AVX512. Here, I have to correct a common misunderstanding. The Zen 4 does not execute a 512-bit vector
> instruction by using a 256-bit execution unit twice, but by using two 256-bit units simultaneously. It
> does not split a 512-bit instruction into two 256-bit micro-operations, like the Zen 1 that splits 256-bit
> instructions into two 128-bit micro-operations. The Zen 4 has four 256-bit execution units. Two of these
> units can do floating point addition, and the other two can do floating point multiplication. All four
> can do integer vector addition etc. This gives a maximum throughput for 512-bit vectors of one floating
> point vector multiplication and one floating point vector addition, or two integer vector additions, per
> clock cycle. This throughput is doubled for vectors of 256 bits or less. It is still advantageous to use
> 512-bit instructions if the throughput is limited by instruction decoding or micro-operation queues or
> code cache or something else. It is rare that execution unit throughput is the bottleneck.
>
> I understand how cracking a 2n-bit instruction into two n-bit instructions and executing them sequentially
> saves area compared to a full 2n-bit-wide unit. But what is the difference between one 2n-bit unit and
> two n-bit units that can execute a 2n-bit instruction at full rate when paired?
There's instructions that can't be nicely split with halves executed separately; like permutes, shuffles, conversions (e.g. 8 doubles -> 8 floats), etc. For "2 separate N-bit units" these cases become an ugly a mess, and for "one 2n-bit unit" there's no problem.
> Or from the other direction,
> why wouldn't a full 2n-bit unit also be designed to execute two n-bit instructions simultaneously?
I'd expect that to be much much easier (if you can afford the extra die area).
- Brendan