By: Adrian (a.delete@this.acm.org), November 8, 2022 4:34 am
Room: Moderated Discussions
anon (anon.delete@this.delete.com) on November 7, 2022 11:34 am wrote:
> Adrian (a.delete@this.acm.org) on November 7, 2022 3:38 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 6, 2022 1:18 pm wrote:
> > > Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> > > >
> > > > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > > > starts over two cycles. The result for each half is ready
> > > > as fast as it would be for a plain 256-bit op, meaning
> > > > no latency increase.
> > >
> > > So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> > > worked and made most integer ops have an latency of just 0.5
> > > cycles? (Except in this case it's not double-pumped,
> > > but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
> > >
> > > I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> > > is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> > > whenever you had an operation that collapsed the whole value (things like masked compares?).
> > >
> > > Or do I misunderstand?
> > >
> > > Linus
> >
> >
> > You understand correctly, but I have not seen yet any test results
> > that prove that this is indeed the AMD implementation.
> >
> >
> > It certainly is the most probable implementation choice,
> > together with the alternative where the second half
> > of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other
> > pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
> >
> >
> > The test that can expose the implementation method must be, as you say, one
> > where the sequential execution would cause an extra cycle of latency, i.e.
> > not based on any of the operations that process the halves independently.
> >
> > Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do
> > shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit
> > has been added to the existing pipelines, i.e. whether it is separate and an operation could be
> > initiated on it simultaneously with the other pipelines, or more likely, it is attached to only
> > one of the existing pipelines, making that pipeline behave differently than the others.
> >
> > So if a test would try to use shuffles for an instruction sequence trying to expose
> > an extra clock cycle of latency, there might be additional complications, requiring
> > a more complex testing for elucidating which is the AMD Zen 4 implementation.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
> Chips & Cheese's IPC results for 2:1 and 1:1 interleaved 256 and 512-bit FMAs
> show that Zen 4 processes both halves on the same pipeline: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
Thanks for pointing that.
I have already browsed through that article, but I was in a hurry and I have not read it carefully. At the first reading, I have noticed that the 512-bit operations are split after scheduling, not before that, but I have not looked at the included IPC table.
The IPC table does indeed demonstrate that Zen 4 does something different from Tiger Lake, which just executes a 512-bit instruction simultaneously, using a pair of 256-bit pipelines.
While these IPC results increase a lot the probability that when a 512-bit operation is split in Zen 4 it is executed in 2 consecutive clock cycles in the same pipeline, they still do not prove this beyond reasonable doubt.
The same IPC values could be obtained if Zen 4 would be able to reorder the 256-bit FMAs around the 512-bit FMA, in order to be able to execute simultaneously a pair of 256-bit FMAs.
In order to be convinced about the sequential processing of the halves, I would have to see the machine instructions of the test code and see how such a reordering is avoided.
Especially the 1.5 IPC value for 2 x 256-bit FMA + 1 x 512-bit FMA could be easily explained by alternating each clock cycle between computing one 512-bit operation and computing two 256-bit operations, even if the same IPC would be obtained by computing in each clock cycle one 256-bit operation and a half of a 512-bit operation.
So for now, what is proven is only that the execution pipelines in Zen 4 are not switched between some persistent 512-bit and 256-bit modes, where the mode switching would require time, but during each clock cycle they can process either 256-bit operands or halves of 512-bit operands. Finer tests are needed to show whether the halves of 512-bit operands are processed simultaneously or sequentially.
> Adrian (a.delete@this.acm.org) on November 7, 2022 3:38 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 6, 2022 1:18 pm wrote:
> > > Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> > > >
> > > > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > > > starts over two cycles. The result for each half is ready
> > > > as fast as it would be for a plain 256-bit op, meaning
> > > > no latency increase.
> > >
> > > So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> > > worked and made most integer ops have an latency of just 0.5
> > > cycles? (Except in this case it's not double-pumped,
> > > but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
> > >
> > > I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> > > is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> > > whenever you had an operation that collapsed the whole value (things like masked compares?).
> > >
> > > Or do I misunderstand?
> > >
> > > Linus
> >
> >
> > You understand correctly, but I have not seen yet any test results
> > that prove that this is indeed the AMD implementation.
> >
> >
> > It certainly is the most probable implementation choice,
> > together with the alternative where the second half
> > of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other
> > pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
> >
> >
> > The test that can expose the implementation method must be, as you say, one
> > where the sequential execution would cause an extra cycle of latency, i.e.
> > not based on any of the operations that process the halves independently.
> >
> > Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do
> > shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit
> > has been added to the existing pipelines, i.e. whether it is separate and an operation could be
> > initiated on it simultaneously with the other pipelines, or more likely, it is attached to only
> > one of the existing pipelines, making that pipeline behave differently than the others.
> >
> > So if a test would try to use shuffles for an instruction sequence trying to expose
> > an extra clock cycle of latency, there might be additional complications, requiring
> > a more complex testing for elucidating which is the AMD Zen 4 implementation.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
> Chips & Cheese's IPC results for 2:1 and 1:1 interleaved 256 and 512-bit FMAs
> show that Zen 4 processes both halves on the same pipeline: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
Thanks for pointing that.
I have already browsed through that article, but I was in a hurry and I have not read it carefully. At the first reading, I have noticed that the 512-bit operations are split after scheduling, not before that, but I have not looked at the included IPC table.
The IPC table does indeed demonstrate that Zen 4 does something different from Tiger Lake, which just executes a 512-bit instruction simultaneously, using a pair of 256-bit pipelines.
While these IPC results increase a lot the probability that when a 512-bit operation is split in Zen 4 it is executed in 2 consecutive clock cycles in the same pipeline, they still do not prove this beyond reasonable doubt.
The same IPC values could be obtained if Zen 4 would be able to reorder the 256-bit FMAs around the 512-bit FMA, in order to be able to execute simultaneously a pair of 256-bit FMAs.
In order to be convinced about the sequential processing of the halves, I would have to see the machine instructions of the test code and see how such a reordering is avoided.
Especially the 1.5 IPC value for 2 x 256-bit FMA + 1 x 512-bit FMA could be easily explained by alternating each clock cycle between computing one 512-bit operation and computing two 256-bit operations, even if the same IPC would be obtained by computing in each clock cycle one 256-bit operation and a half of a 512-bit operation.
So for now, what is proven is only that the execution pipelines in Zen 4 are not switched between some persistent 512-bit and 256-bit modes, where the mode switching would require time, but during each clock cycle they can process either 256-bit operands or halves of 512-bit operands. Finer tests are needed to show whether the halves of 512-bit operands are processed simultaneously or sequentially.