By: anon (anon.delete@this.delete.com), November 7, 2022 12:34 pm
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on November 7, 2022 3:38 am wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 6, 2022 1:18 pm wrote:
> > Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> > >
> > > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > > starts over two cycles. The result for each half is ready
> > > as fast as it would be for a plain 256-bit op, meaning
> > > no latency increase.
> >
> > So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> > worked and made most integer ops have an latency of just 0.5
> > cycles? (Except in this case it's not double-pumped,
> > but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
> >
> > I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> > is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> > whenever you had an operation that collapsed the whole value (things like masked compares?).
> >
> > Or do I misunderstand?
> >
> > Linus
>
>
> You understand correctly, but I have not seen yet any test results
> that prove that this is indeed the AMD implementation.
>
>
> It certainly is the most probable implementation choice, together with the alternative where the second half
> of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other
> pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
>
>
> The test that can expose the implementation method must be, as you say, one
> where the sequential execution would cause an extra cycle of latency, i.e.
> not based on any of the operations that process the halves independently.
>
> Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do
> shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit
> has been added to the existing pipelines, i.e. whether it is separate and an operation could be
> initiated on it simultaneously with the other pipelines, or more likely, it is attached to only
> one of the existing pipelines, making that pipeline behave differently than the others.
>
> So if a test would try to use shuffles for an instruction sequence trying to expose
> an extra clock cycle of latency, there might be additional complications, requiring
> a more complex testing for elucidating which is the AMD Zen 4 implementation.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
Chips & Cheese's IPC results for 2:1 and 1:1 interleaved 256 and 512-bit FMAs show that Zen 4 processes both halves on the same pipeline: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 6, 2022 1:18 pm wrote:
> > Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> > >
> > > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > > starts over two cycles. The result for each half is ready
> > > as fast as it would be for a plain 256-bit op, meaning
> > > no latency increase.
> >
> > So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> > worked and made most integer ops have an latency of just 0.5
> > cycles? (Except in this case it's not double-pumped,
> > but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
> >
> > I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> > is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> > whenever you had an operation that collapsed the whole value (things like masked compares?).
> >
> > Or do I misunderstand?
> >
> > Linus
>
>
> You understand correctly, but I have not seen yet any test results
> that prove that this is indeed the AMD implementation.
>
>
> It certainly is the most probable implementation choice, together with the alternative where the second half
> of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other
> pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
>
>
> The test that can expose the implementation method must be, as you say, one
> where the sequential execution would cause an extra cycle of latency, i.e.
> not based on any of the operations that process the halves independently.
>
> Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do
> shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit
> has been added to the existing pipelines, i.e. whether it is separate and an operation could be
> initiated on it simultaneously with the other pipelines, or more likely, it is attached to only
> one of the existing pipelines, making that pipeline behave differently than the others.
>
> So if a test would try to use shuffles for an instruction sequence trying to expose
> an extra clock cycle of latency, there might be additional complications, requiring
> a more complex testing for elucidating which is the AMD Zen 4 implementation.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
Chips & Cheese's IPC results for 2:1 and 1:1 interleaved 256 and 512-bit FMAs show that Zen 4 processes both halves on the same pipeline: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/