By: Anon (no.delete@this.spam.com), November 5, 2022 4:43 pm
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on November 5, 2022 4:24 pm wrote:
> That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> starts over two cycles. The result for each half is ready as fast as it would be for a plain 256-bit op, meaning
> no latency increase. I'm pretty sure I state it's broken up *after* it enters the execution pipe.
>
> Data leading to conclusion: "Each AVX-512 instruction thus only consumes one entry in the relevant out-of-order
> execution buffers." -> explanation: I measured reordering capacity for 512-bit and 256-bit versions of the
> same operation and get the same scheduling capacity, and number of those that can be pending retirement
> (before two long latency loads on either end of the instruction sequence no longer overlap).
>
> Performance counters also show a single micro-op retired per instruction on 512-bit vectors, except
> for stores. And for 512-bit math ops, a single micro-op assigned to each FP execution pipe.
Thanks, so Agner Fog was wrong when he said:
"The Zen 4 does not execute a 512-bit vector instruction by using a 256-bit execution unit twice, but by using two 256-bit units simultaneously"?
> That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> starts over two cycles. The result for each half is ready as fast as it would be for a plain 256-bit op, meaning
> no latency increase. I'm pretty sure I state it's broken up *after* it enters the execution pipe.
>
> Data leading to conclusion: "Each AVX-512 instruction thus only consumes one entry in the relevant out-of-order
> execution buffers." -> explanation: I measured reordering capacity for 512-bit and 256-bit versions of the
> same operation and get the same scheduling capacity, and number of those that can be pending retirement
> (before two long latency loads on either end of the instruction sequence no longer overlap).
>
> Performance counters also show a single micro-op retired per instruction on 512-bit vectors, except
> for stores. And for 512-bit math ops, a single micro-op assigned to each FP execution pipe.
Thanks, so Agner Fog was wrong when he said:
"The Zen 4 does not execute a 512-bit vector instruction by using a 256-bit execution unit twice, but by using two 256-bit units simultaneously"?