By: Chester (lamchester.delete@this.gmail.com), November 8, 2022 8:29 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on November 8, 2022 3:34 am wrote:
> anon (anon.delete@this.delete.com) on November 7, 2022 11:34 am wrote:
> > Adrian (a.delete@this.acm.org) on November 7, 2022 3:38 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 6, 2022 1:18 pm wrote:
> > > > Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> > > > >
> > > > > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > > > > starts over two cycles. The result for each half is ready
> > > > > as fast as it would be for a plain 256-bit op, meaning
> > > > > no latency increase.
> > > >
> > > > So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> > > > worked and made most integer ops have an latency of just 0.5
> > > > cycles? (Except in this case it's not double-pumped,
> > > > but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
> > > >
> > > > I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> > > > is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> > > > whenever you had an operation that collapsed the whole value (things like masked compares?).
> > > >
> > > > Or do I misunderstand?
> > > >
> > > > Linus
Yes, that's my interpretation of the situation, given data from performance counters, no measured increase in latency (didn't test anything that would collapse the whole value), and no throughput loss for mixing 256-bit and 512-bit.
> > > You understand correctly, but I have not seen yet any test results
> > > that prove that this is indeed the AMD implementation.
> > >
> > >
> > > It certainly is the most probable implementation choice,
> > > together with the alternative where the second half
> > > of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other
> > > pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
> > >
> > >
> > > The test that can expose the implementation method must be, as you say, one
> > > where the sequential execution would cause an extra cycle of latency, i.e.
> > > not based on any of the operations that process the halves independently.
Yeah, that would be interesting to test, though the system isn't up anymore.
"second half of the operand is processed .... in the same cycle in the other pipeline of the same kind" - that explanation is extremely unlikely since the number of micro-ops bound to the FP pipes == micro-ops retired == instructions retired when 512-bit math instructions are used. One scheduler would need to be able to cross-issue a single micro-op to the other scheduler's ports, which I think would be extremely messy.
> > > Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do
> > > shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit
> > > has been added to the existing pipelines, i.e. whether it is separate and an operation could be
> > > initiated on it simultaneously with the other pipelines, or more likely, it is attached to only
> > > one of the existing pipelines, making that pipeline behave differently than the others.
> > >
> > > So if a test would try to use shuffles for an instruction sequence trying to expose
> > > an extra clock cycle of latency, there might be additional complications, requiring
> > > a more complex testing for elucidating which is the AMD Zen 4 implementation.
> >
> > Chips & Cheese's IPC results for 2:1 and 1:1 interleaved 256 and 512-bit FMAs
> > show that Zen 4 processes both halves on the same pipeline: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
>
>
> Thanks for pointing that.
>
> I have already browsed through that article, but I was in a hurry and I have not read
> it carefully. At the first reading, I have noticed that the 512-bit operations are split
> after scheduling, not before that, but I have not looked at the included IPC table.
>
> The IPC table does indeed demonstrate that Zen 4 does something different from Tiger Lake, which
> just executes a 512-bit instruction simultaneously, using a pair of 256-bit pipelines.
>
> While these IPC results increase a lot the probability that when a 512-bit
> operation is split in Zen 4 it is executed in 2 consecutive clock cycles in
> the same pipeline, they still do not prove this beyond reasonable doubt.
>
> The same IPC values could be obtained if Zen 4 would be able to reorder the 256-bit FMAs around
> the 512-bit FMA, in order to be able to execute simultaneously a pair of 256-bit FMAs.
Yeah that's exactly why the test had a 2:1 ratio.
Same pipeline = micro-op counts as mentioned above
>
> In order to be convinced about the sequential processing of the halves, I would have to
> see the machine instructions of the test code and see how such a reordering is avoided.
>
> Especially the 1.5 IPC value for 2 x 256-bit FMA + 1 x 512-bit FMA could be easily
> explained by alternating each clock cycle between computing one 512-bit operation and
> computing two 256-bit operations, even if the same IPC would be obtained by computing
> in each clock cycle one 256-bit operation and a half of a 512-bit operation.
>
>
> So for now, what is proven is only that the execution pipelines in Zen 4 are not switched between some
> persistent 512-bit and 256-bit modes, where the mode switching would require time, but during each clock
> cycle they can process either 256-bit operands or halves of 512-bit operands. Finer tests are needed
> to show whether the halves of 512-bit operands are processed simultaneously or sequentially.
Yes, it behaves similarly to CNS, which does split 512-bit instructions into two micro-ops. Except Zen 4 retains full scheduler and register file capacity with 512-bit instructions, while you get half on CNS (excluding stores, which are broken up on Zen 4)
I don't think switching necessarily requires time. Another possibility is FMA units can't simultaneously have a 1x512-bit instruction and 2x256-bit instructions in flight at the same time. But this would require testing on hardware I personally own and time I don't have.
> anon (anon.delete@this.delete.com) on November 7, 2022 11:34 am wrote:
> > Adrian (a.delete@this.acm.org) on November 7, 2022 3:38 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 6, 2022 1:18 pm wrote:
> > > > Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> > > > >
> > > > > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > > > > starts over two cycles. The result for each half is ready
> > > > > as fast as it would be for a plain 256-bit op, meaning
> > > > > no latency increase.
> > > >
> > > > So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> > > > worked and made most integer ops have an latency of just 0.5
> > > > cycles? (Except in this case it's not double-pumped,
> > > > but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
> > > >
> > > > I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> > > > is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> > > > whenever you had an operation that collapsed the whole value (things like masked compares?).
> > > >
> > > > Or do I misunderstand?
> > > >
> > > > Linus
Yes, that's my interpretation of the situation, given data from performance counters, no measured increase in latency (didn't test anything that would collapse the whole value), and no throughput loss for mixing 256-bit and 512-bit.
> > > You understand correctly, but I have not seen yet any test results
> > > that prove that this is indeed the AMD implementation.
> > >
> > >
> > > It certainly is the most probable implementation choice,
> > > together with the alternative where the second half
> > > of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other
> > > pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
> > >
> > >
> > > The test that can expose the implementation method must be, as you say, one
> > > where the sequential execution would cause an extra cycle of latency, i.e.
> > > not based on any of the operations that process the halves independently.
Yeah, that would be interesting to test, though the system isn't up anymore.
"second half of the operand is processed .... in the same cycle in the other pipeline of the same kind" - that explanation is extremely unlikely since the number of micro-ops bound to the FP pipes == micro-ops retired == instructions retired when 512-bit math instructions are used. One scheduler would need to be able to cross-issue a single micro-op to the other scheduler's ports, which I think would be extremely messy.
> > > Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do
> > > shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit
> > > has been added to the existing pipelines, i.e. whether it is separate and an operation could be
> > > initiated on it simultaneously with the other pipelines, or more likely, it is attached to only
> > > one of the existing pipelines, making that pipeline behave differently than the others.
> > >
> > > So if a test would try to use shuffles for an instruction sequence trying to expose
> > > an extra clock cycle of latency, there might be additional complications, requiring
> > > a more complex testing for elucidating which is the AMD Zen 4 implementation.
> >
> > Chips & Cheese's IPC results for 2:1 and 1:1 interleaved 256 and 512-bit FMAs
> > show that Zen 4 processes both halves on the same pipeline: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
>
>
> Thanks for pointing that.
>
> I have already browsed through that article, but I was in a hurry and I have not read
> it carefully. At the first reading, I have noticed that the 512-bit operations are split
> after scheduling, not before that, but I have not looked at the included IPC table.
>
> The IPC table does indeed demonstrate that Zen 4 does something different from Tiger Lake, which
> just executes a 512-bit instruction simultaneously, using a pair of 256-bit pipelines.
>
> While these IPC results increase a lot the probability that when a 512-bit
> operation is split in Zen 4 it is executed in 2 consecutive clock cycles in
> the same pipeline, they still do not prove this beyond reasonable doubt.
>
> The same IPC values could be obtained if Zen 4 would be able to reorder the 256-bit FMAs around
> the 512-bit FMA, in order to be able to execute simultaneously a pair of 256-bit FMAs.
Yeah that's exactly why the test had a 2:1 ratio.
Same pipeline = micro-op counts as mentioned above
>
> In order to be convinced about the sequential processing of the halves, I would have to
> see the machine instructions of the test code and see how such a reordering is avoided.
>
> Especially the 1.5 IPC value for 2 x 256-bit FMA + 1 x 512-bit FMA could be easily
> explained by alternating each clock cycle between computing one 512-bit operation and
> computing two 256-bit operations, even if the same IPC would be obtained by computing
> in each clock cycle one 256-bit operation and a half of a 512-bit operation.
>
>
> So for now, what is proven is only that the execution pipelines in Zen 4 are not switched between some
> persistent 512-bit and 256-bit modes, where the mode switching would require time, but during each clock
> cycle they can process either 256-bit operands or halves of 512-bit operands. Finer tests are needed
> to show whether the halves of 512-bit operands are processed simultaneously or sequentially.
Yes, it behaves similarly to CNS, which does split 512-bit instructions into two micro-ops. Except Zen 4 retains full scheduler and register file capacity with 512-bit instructions, while you get half on CNS (excluding stores, which are broken up on Zen 4)
I don't think switching necessarily requires time. Another possibility is FMA units can't simultaneously have a 1x512-bit instruction and 2x256-bit instructions in flight at the same time. But this would require testing on hardware I personally own and time I don't have.