By: anon (anon.delete@this.delete.com), November 8, 2022 9:01 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on November 8, 2022 3:34 am wrote:
> anon (anon.delete@this.delete.com) on November 7, 2022 11:34 am wrote:
> > Adrian (a.delete@this.acm.org) on November 7, 2022 3:38 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 6, 2022 1:18 pm wrote:
> > > > Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> > > > >
> > > > > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > > > > starts over two cycles. The result for each half is ready
> > > > > as fast as it would be for a plain 256-bit op, meaning
> > > > > no latency increase.
> > > >
> > > > So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> > > > worked and made most integer ops have an latency of just 0.5
> > > > cycles? (Except in this case it's not double-pumped,
> > > > but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
> > > >
> > > > I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> > > > is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> > > > whenever you had an operation that collapsed the whole value (things like masked compares?).
> > > >
> > > > Or do I misunderstand?
> > > >
> > > > Linus
> > >
> > >
> > > You understand correctly, but I have not seen yet any test results
> > > that prove that this is indeed the AMD implementation.
> > >
> > >
> > > It certainly is the most probable implementation choice,
> > > together with the alternative where the second half
> > > of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other
> > > pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
> > >
> > >
> > > The test that can expose the implementation method must be, as you say, one
> > > where the sequential execution would cause an extra cycle of latency, i.e.
> > > not based on any of the operations that process the halves independently.
> > >
> > > Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do
> > > shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit
> > > has been added to the existing pipelines, i.e. whether it is separate and an operation could be
> > > initiated on it simultaneously with the other pipelines, or more likely, it is attached to only
> > > one of the existing pipelines, making that pipeline behave differently than the others.
> > >
> > > So if a test would try to use shuffles for an instruction sequence trying to expose
> > > an extra clock cycle of latency, there might be additional complications, requiring
> > > a more complex testing for elucidating which is the AMD Zen 4 implementation.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> > Chips & Cheese's IPC results for 2:1 and 1:1 interleaved 256 and 512-bit FMAs
> > show that Zen 4 processes both halves on the same pipeline: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
>
>
> Thanks for pointing that.
>
> I have already browsed through that article, but I was in a hurry and I have not read
> it carefully. At the first reading, I have noticed that the 512-bit operations are split
> after scheduling, not before that, but I have not looked at the included IPC table.
>
> The IPC table does indeed demonstrate that Zen 4 does something different from Tiger Lake, which
> just executes a 512-bit instruction simultaneously, using a pair of 256-bit pipelines.
>
> While these IPC results increase a lot the probability that when a 512-bit
> operation is split in Zen 4 it is executed in 2 consecutive clock cycles in
> the same pipeline, they still do not prove this beyond reasonable doubt.
>
> The same IPC values could be obtained if Zen 4 would be able to reorder the 256-bit FMAs around
> the 512-bit FMA, in order to be able to execute simultaneously a pair of 256-bit FMAs.
>
> In order to be convinced about the sequential processing of the halves, I would have to
> see the machine instructions of the test code and see how such a reordering is avoided.
>
> Especially the 1.5 IPC value for 2 x 256-bit FMA + 1 x 512-bit FMA could be easily
> explained by alternating each clock cycle between computing one 512-bit operation and
> computing two 256-bit operations, even if the same IPC would be obtained by computing
> in each clock cycle one 256-bit operation and a half of a 512-bit operation.
>
>
> So for now, what is proven is only that the execution pipelines in Zen 4 are not switched between some
> persistent 512-bit and 256-bit modes, where the mode switching would require time, but during each clock
> cycle they can process either 256-bit operands or halves of 512-bit operands. Finer tests are needed
> to show whether the halves of 512-bit operands are processed simultaneously or sequentially.
>
>
The fact that 1:1 interleaving gives an IPC greater than 1 shows that Zen 4 does not force both halves to be processed on the same cycle (assuming Chips & Cheese's test uses a sufficiently long sequence)
> anon (anon.delete@this.delete.com) on November 7, 2022 11:34 am wrote:
> > Adrian (a.delete@this.acm.org) on November 7, 2022 3:38 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 6, 2022 1:18 pm wrote:
> > > > Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> > > > >
> > > > > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > > > > starts over two cycles. The result for each half is ready
> > > > > as fast as it would be for a plain 256-bit op, meaning
> > > > > no latency increase.
> > > >
> > > > So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> > > > worked and made most integer ops have an latency of just 0.5
> > > > cycles? (Except in this case it's not double-pumped,
> > > > but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
> > > >
> > > > I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> > > > is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> > > > whenever you had an operation that collapsed the whole value (things like masked compares?).
> > > >
> > > > Or do I misunderstand?
> > > >
> > > > Linus
> > >
> > >
> > > You understand correctly, but I have not seen yet any test results
> > > that prove that this is indeed the AMD implementation.
> > >
> > >
> > > It certainly is the most probable implementation choice,
> > > together with the alternative where the second half
> > > of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other
> > > pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
> > >
> > >
> > > The test that can expose the implementation method must be, as you say, one
> > > where the sequential execution would cause an extra cycle of latency, i.e.
> > > not based on any of the operations that process the halves independently.
> > >
> > > Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do
> > > shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit
> > > has been added to the existing pipelines, i.e. whether it is separate and an operation could be
> > > initiated on it simultaneously with the other pipelines, or more likely, it is attached to only
> > > one of the existing pipelines, making that pipeline behave differently than the others.
> > >
> > > So if a test would try to use shuffles for an instruction sequence trying to expose
> > > an extra clock cycle of latency, there might be additional complications, requiring
> > > a more complex testing for elucidating which is the AMD Zen 4 implementation.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> > Chips & Cheese's IPC results for 2:1 and 1:1 interleaved 256 and 512-bit FMAs
> > show that Zen 4 processes both halves on the same pipeline: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
>
>
> Thanks for pointing that.
>
> I have already browsed through that article, but I was in a hurry and I have not read
> it carefully. At the first reading, I have noticed that the 512-bit operations are split
> after scheduling, not before that, but I have not looked at the included IPC table.
>
> The IPC table does indeed demonstrate that Zen 4 does something different from Tiger Lake, which
> just executes a 512-bit instruction simultaneously, using a pair of 256-bit pipelines.
>
> While these IPC results increase a lot the probability that when a 512-bit
> operation is split in Zen 4 it is executed in 2 consecutive clock cycles in
> the same pipeline, they still do not prove this beyond reasonable doubt.
>
> The same IPC values could be obtained if Zen 4 would be able to reorder the 256-bit FMAs around
> the 512-bit FMA, in order to be able to execute simultaneously a pair of 256-bit FMAs.
>
> In order to be convinced about the sequential processing of the halves, I would have to
> see the machine instructions of the test code and see how such a reordering is avoided.
>
> Especially the 1.5 IPC value for 2 x 256-bit FMA + 1 x 512-bit FMA could be easily
> explained by alternating each clock cycle between computing one 512-bit operation and
> computing two 256-bit operations, even if the same IPC would be obtained by computing
> in each clock cycle one 256-bit operation and a half of a 512-bit operation.
>
>
> So for now, what is proven is only that the execution pipelines in Zen 4 are not switched between some
> persistent 512-bit and 256-bit modes, where the mode switching would require time, but during each clock
> cycle they can process either 256-bit operands or halves of 512-bit operands. Finer tests are needed
> to show whether the halves of 512-bit operands are processed simultaneously or sequentially.
>
>
The fact that 1:1 interleaving gives an IPC greater than 1 shows that Zen 4 does not force both halves to be processed on the same cycle (assuming Chips & Cheese's test uses a sufficiently long sequence)