By: Adrian (a.delete@this.acm.org), November 7, 2022 4:38 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 6, 2022 1:18 pm wrote:
> Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> >
> > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > starts over two cycles. The result for each half is ready
> > as fast as it would be for a plain 256-bit op, meaning
> > no latency increase.
>
> So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> worked and made most integer ops have an latency of just 0.5 cycles? (Except in this case it's not double-pumped,
> but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
>
> I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> whenever you had an operation that collapsed the whole value (things like masked compares?).
>
> Or do I misunderstand?
>
> Linus
You understand correctly, but I have not seen yet any test results that prove that this is indeed the AMD implementation.
It certainly is the most probable implementation choice, together with the alternative where the second half of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
The test that can expose the implementation method must be, as you say, one where the sequential execution would cause an extra cycle of latency, i.e. not based on any of the operations that process the halves independently.
Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit has been added to the existing pipelines, i.e. whether it is separate and an operation could be initiated on it simultaneously with the other pipelines, or more likely, it is attached to only one of the existing pipelines, making that pipeline behave differently than the others.
So if a test would try to use shuffles for an instruction sequence trying to expose an extra clock cycle of latency, there might be additional complications, requiring a more complex testing for elucidating which is the AMD Zen 4 implementation.
> Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> >
> > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > starts over two cycles. The result for each half is ready
> > as fast as it would be for a plain 256-bit op, meaning
> > no latency increase.
>
> So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> worked and made most integer ops have an latency of just 0.5 cycles? (Except in this case it's not double-pumped,
> but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
>
> I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> whenever you had an operation that collapsed the whole value (things like masked compares?).
>
> Or do I misunderstand?
>
> Linus
You understand correctly, but I have not seen yet any test results that prove that this is indeed the AMD implementation.
It certainly is the most probable implementation choice, together with the alternative where the second half of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
The test that can expose the implementation method must be, as you say, one where the sequential execution would cause an extra cycle of latency, i.e. not based on any of the operations that process the halves independently.
Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit has been added to the existing pipelines, i.e. whether it is separate and an operation could be initiated on it simultaneously with the other pipelines, or more likely, it is attached to only one of the existing pipelines, making that pipeline behave differently than the others.
So if a test would try to use shuffles for an instruction sequence trying to expose an extra clock cycle of latency, there might be additional complications, requiring a more complex testing for elucidating which is the AMD Zen 4 implementation.