By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), November 6, 2022 2:18 pm
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
>
> That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> starts over two cycles. The result for each half is ready as fast as it would be for a plain 256-bit op, meaning
> no latency increase.
So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU worked and made most integer ops have an latency of just 0.5 cycles? (Except in this case it's not double-pumped, but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512 is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency whenever you had an operation that collapsed the whole value (things like masked compares?).
Or do I misunderstand?
Linus
>
> That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> starts over two cycles. The result for each half is ready as fast as it would be for a plain 256-bit op, meaning
> no latency increase.
So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU worked and made most integer ops have an latency of just 0.5 cycles? (Except in this case it's not double-pumped, but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512 is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency whenever you had an operation that collapsed the whole value (things like masked compares?).
Or do I misunderstand?
Linus