Expanded question about design points

By: anon (anon.delete@this.delete.com), November 8, 2022 9:01 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on November 8, 2022 3:34 am wrote:
> anon (anon.delete@this.delete.com) on November 7, 2022 11:34 am wrote:
> > Adrian (a.delete@this.acm.org) on November 7, 2022 3:38 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 6, 2022 1:18 pm wrote:
> > > > Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> > > > >
> > > > > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > > > > starts over two cycles. The result for each half is ready
> > > > > as fast as it would be for a plain 256-bit op, meaning
> > > > > no latency increase.
> > > >
> > > > So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> > > > worked and made most integer ops have an latency of just 0.5
> > > > cycles? (Except in this case it's not double-pumped,
> > > > but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
> > > >
> > > > I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> > > > is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> > > > whenever you had an operation that collapsed the whole value (things like masked compares?).
> > > >
> > > > Or do I misunderstand?
> > > >
> > > > Linus
> > >
> > >
> > > You understand correctly, but I have not seen yet any test results
> > > that prove that this is indeed the AMD implementation.
> > >
> > >
> > > It certainly is the most probable implementation choice,
> > > together with the alternative where the second half
> > > of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other
> > > pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
> > >
> > >
> > > The test that can expose the implementation method must be, as you say, one
> > > where the sequential execution would cause an extra cycle of latency, i.e.
> > > not based on any of the operations that process the halves independently.
> > >
> > > Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do
> > > shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit
> > > has been added to the existing pipelines, i.e. whether it is separate and an operation could be
> > > initiated on it simultaneously with the other pipelines, or more likely, it is attached to only
> > > one of the existing pipelines, making that pipeline behave differently than the others.
> > >
> > > So if a test would try to use shuffles for an instruction sequence trying to expose
> > > an extra clock cycle of latency, there might be additional complications, requiring
> > > a more complex testing for elucidating which is the AMD Zen 4 implementation.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> > Chips & Cheese's IPC results for 2:1 and 1:1 interleaved 256 and 512-bit FMAs
> > show that Zen 4 processes both halves on the same pipeline: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
>
>
> Thanks for pointing that.
>
> I have already browsed through that article, but I was in a hurry and I have not read
> it carefully. At the first reading, I have noticed that the 512-bit operations are split
> after scheduling, not before that, but I have not looked at the included IPC table.
>
> The IPC table does indeed demonstrate that Zen 4 does something different from Tiger Lake, which
> just executes a 512-bit instruction simultaneously, using a pair of 256-bit pipelines.
>
> While these IPC results increase a lot the probability that when a 512-bit
> operation is split in Zen 4 it is executed in 2 consecutive clock cycles in
> the same pipeline, they still do not prove this beyond reasonable doubt.
>
> The same IPC values could be obtained if Zen 4 would be able to reorder the 256-bit FMAs around
> the 512-bit FMA, in order to be able to execute simultaneously a pair of 256-bit FMAs.
>
> In order to be convinced about the sequential processing of the halves, I would have to
> see the machine instructions of the test code and see how such a reordering is avoided.
>
> Especially the 1.5 IPC value for 2 x 256-bit FMA + 1 x 512-bit FMA could be easily
> explained by alternating each clock cycle between computing one 512-bit operation and
> computing two 256-bit operations, even if the same IPC would be obtained by computing
> in each clock cycle one 256-bit operation and a half of a 512-bit operation.
>
>
> So for now, what is proven is only that the execution pipelines in Zen 4 are not switched between some
> persistent 512-bit and 256-bit modes, where the mode switching would require time, but during each clock
> cycle they can process either 256-bit operands or halves of 512-bit operands. Finer tests are needed
> to show whether the halves of 512-bit operands are processed simultaneously or sequentially.
>
>

The fact that 1:1 interleaving gives an IPC greater than 1 shows that Zen 4 does not force both halves to be processed on the same cycle (assuming Chips & Cheese's test uses a sufficiently long sequence)
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Jeffrey Bosboom2022/11/04 06:18 PM
  Clarification?Mark Roulo2022/11/04 08:34 PM
    Expanded question about design pointsJeffrey Bosboom2022/11/04 10:37 PM
      Expanded question about design pointsAnon2022/11/04 10:53 PM
        Expanded question about design pointsJeffrey Bosboom2022/11/04 11:05 PM
          Expanded question about design pointsAnon2022/11/04 11:30 PM
            Expanded question about design pointsChester2022/11/05 04:24 PM
              Expanded question about design pointsAnon2022/11/05 04:43 PM
              Expanded question about design pointsLinus Torvalds2022/11/06 02:18 PM
                Expanded question about design pointsAdrian2022/11/07 04:38 AM
                  Expanded question about design pointsanon2022/11/07 12:34 PM
                    Expanded question about design pointsAdrian2022/11/08 04:34 AM
                      Expanded question about design pointsChester2022/11/08 08:29 AM
                      Expanded question about design pointsanon2022/11/08 09:01 AM
                        Expanded question about design pointsAdrian2022/11/08 09:53 AM
                          Expanded question about design pointsLinus Torvalds2022/11/08 11:35 AM
                            Expanded question about design pointsBrett2022/11/08 12:33 PM
                              Expanded question about design pointsBrett2022/11/08 12:48 PM
                              Expanded question about design points---2022/11/09 11:41 AM
                            Expanded question about design pointsAdrian2022/11/08 12:45 PM
                              Expanded question about design pointsLinus Torvalds2022/11/08 01:29 PM
                                Expanded question about design pointsanon2022/11/08 01:58 PM
                              Zen 4cJames2022/11/09 03:54 AM
                                Zen 4cAndrew Clough2022/11/09 05:59 AM
                                  Zen 4canonymou52022/11/09 12:29 PM
                                    Zen 4cChester2022/11/09 09:12 PM
                            Expanded question about design pointsBjörn Ragnar Björnsson2022/11/08 09:24 PM
                              FP Adders are not so cheap compared to FP multipliersHeikki Kultala2022/11/09 09:07 AM
                                FP Adders are not so cheap compared to FP multipliersBjörn Ragnar Björnsson2022/11/10 12:10 AM
                          Expanded question about design pointsAnon2022/11/08 06:31 PM
      Expanded question about design pointsAdrian2022/11/05 03:00 AM
        Expanded question about design pointsAnon2022/11/05 03:27 AM
          Expanded question about design pointsAdrian2022/11/05 03:50 AM
            Expanded question about design pointsAnon2022/11/05 04:10 AM
              Expanded question about design pointsAdrian2022/11/05 07:34 AM
        Expanded question about design pointshobold2022/11/06 04:48 AM
          Expanded question about design pointsAdrian2022/11/07 04:19 AM
            Expanded question about design pointsAdrian2022/11/07 09:07 AM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Anon2022/11/04 08:49 PM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512noko2022/11/04 09:49 PM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Brendan2022/11/05 02:07 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊