By: Chester (lamchester.delete@this.gmail.com), November 5, 2022 4:24 pm
Room: Moderated Discussions
> For Zen 4 everything seems to be 256 bits, what I want to know is how the testers concluded
> the same scheduller entry is feeding two units in what seems to be a different schedullers.
That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution starts over two cycles. The result for each half is ready as fast as it would be for a plain 256-bit op, meaning no latency increase. I'm pretty sure I state it's broken up *after* it enters the execution pipe.
Data leading to conclusion: "Each AVX-512 instruction thus only consumes one entry in the relevant out-of-order execution buffers." -> explanation: I measured reordering capacity for 512-bit and 256-bit versions of the same operation and get the same scheduling capacity, and number of those that can be pending retirement (before two long latency loads on either end of the instruction sequence no longer overlap).
Performance counters also show a single micro-op retired per instruction on 512-bit vectors, except for stores. And for 512-bit math ops, a single micro-op assigned to each FP execution pipe.
> the same scheduller entry is feeding two units in what seems to be a different schedullers.
That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution starts over two cycles. The result for each half is ready as fast as it would be for a plain 256-bit op, meaning no latency increase. I'm pretty sure I state it's broken up *after* it enters the execution pipe.
Data leading to conclusion: "Each AVX-512 instruction thus only consumes one entry in the relevant out-of-order execution buffers." -> explanation: I measured reordering capacity for 512-bit and 256-bit versions of the same operation and get the same scheduling capacity, and number of those that can be pending retirement (before two long latency loads on either end of the instruction sequence no longer overlap).
Performance counters also show a single micro-op retired per instruction on 512-bit vectors, except for stores. And for 512-bit math ops, a single micro-op assigned to each FP execution pipe.