Expanded question about design points

By: anon (anon.delete@this.delete.com), November 7, 2022 12:34 pm
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on November 7, 2022 3:38 am wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 6, 2022 1:18 pm wrote:
> > Chester (lamchester.delete@this.gmail.com) on November 5, 2022 3:24 pm wrote:
> > >
> > > That was not concluded. Rather it seems like a 512-bit op is fed into a single 256-bit pipe, and execution
> > > starts over two cycles. The result for each half is ready
> > > as fast as it would be for a plain 256-bit op, meaning
> > > no latency increase.
> >
> > So the upper 256 bits are always staggered by one cycle? Kind of like how the original P4 double-pumped ALU
> > worked and made most integer ops have an latency of just 0.5
> > cycles? (Except in this case it's not double-pumped,
> > but you end up with an effective latency of 1 cycles even if the "whole" operation takes two).
> >
> > I guess for any throughput loads that's basically unnoticeable and perfectly fine (and AVX512
> > is pretty much about throughput), but I'd assume you end up seeing the extra cycle of latency
> > whenever you had an operation that collapsed the whole value (things like masked compares?).
> >
> > Or do I misunderstand?
> >
> > Linus
>
>
> You understand correctly, but I have not seen yet any test results
> that prove that this is indeed the AMD implementation.
>
>
> It certainly is the most probable implementation choice, together with the alternative where the second half
> of the operand is processed not in the next cycle in the same pipeline, but in the same cycle in the other
> pipeline of the same kind (the Zen 3/4 SIMD pipelines are grouped in pairs with the same properties).
>
>
> The test that can expose the implementation method must be, as you say, one
> where the sequential execution would cause an extra cycle of latency, i.e.
> not based on any of the operations that process the halves independently.
>
> Besides the Zen 3 pipelines, Zen 4 is said to have a new shuffle unit, which enables it to do
> shuffles where the halves of a 512-bit operand are crossed. I do not know how this shuffle unit
> has been added to the existing pipelines, i.e. whether it is separate and an operation could be
> initiated on it simultaneously with the other pipelines, or more likely, it is attached to only
> one of the existing pipelines, making that pipeline behave differently than the others.
>
> So if a test would try to use shuffles for an instruction sequence trying to expose
> an extra clock cycle of latency, there might be additional complications, requiring
> a more complex testing for elucidating which is the AMD Zen 4 implementation.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Chips & Cheese's IPC results for 2:1 and 1:1 interleaved 256 and 512-bit FMAs show that Zen 4 processes both halves on the same pipeline: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Jeffrey Bosboom2022/11/04 06:18 PM
  Clarification?Mark Roulo2022/11/04 08:34 PM
    Expanded question about design pointsJeffrey Bosboom2022/11/04 10:37 PM
      Expanded question about design pointsAnon2022/11/04 10:53 PM
        Expanded question about design pointsJeffrey Bosboom2022/11/04 11:05 PM
          Expanded question about design pointsAnon2022/11/04 11:30 PM
            Expanded question about design pointsChester2022/11/05 04:24 PM
              Expanded question about design pointsAnon2022/11/05 04:43 PM
              Expanded question about design pointsLinus Torvalds2022/11/06 02:18 PM
                Expanded question about design pointsAdrian2022/11/07 04:38 AM
                  Expanded question about design pointsanon2022/11/07 12:34 PM
                    Expanded question about design pointsAdrian2022/11/08 04:34 AM
                      Expanded question about design pointsChester2022/11/08 08:29 AM
                      Expanded question about design pointsanon2022/11/08 09:01 AM
                        Expanded question about design pointsAdrian2022/11/08 09:53 AM
                          Expanded question about design pointsLinus Torvalds2022/11/08 11:35 AM
                            Expanded question about design pointsBrett2022/11/08 12:33 PM
                              Expanded question about design pointsBrett2022/11/08 12:48 PM
                              Expanded question about design points---2022/11/09 11:41 AM
                            Expanded question about design pointsAdrian2022/11/08 12:45 PM
                              Expanded question about design pointsLinus Torvalds2022/11/08 01:29 PM
                                Expanded question about design pointsanon2022/11/08 01:58 PM
                              Zen 4cJames2022/11/09 03:54 AM
                                Zen 4cAndrew Clough2022/11/09 05:59 AM
                                  Zen 4canonymou52022/11/09 12:29 PM
                                    Zen 4cChester2022/11/09 09:12 PM
                            Expanded question about design pointsBjörn Ragnar Björnsson2022/11/08 09:24 PM
                              FP Adders are not so cheap compared to FP multipliersHeikki Kultala2022/11/09 09:07 AM
                                FP Adders are not so cheap compared to FP multipliersBjörn Ragnar Björnsson2022/11/10 12:10 AM
                          Expanded question about design pointsAnon2022/11/08 06:31 PM
      Expanded question about design pointsAdrian2022/11/05 03:00 AM
        Expanded question about design pointsAnon2022/11/05 03:27 AM
          Expanded question about design pointsAdrian2022/11/05 03:50 AM
            Expanded question about design pointsAnon2022/11/05 04:10 AM
              Expanded question about design pointsAdrian2022/11/05 07:34 AM
        Expanded question about design pointshobold2022/11/06 04:48 AM
          Expanded question about design pointsAdrian2022/11/07 04:19 AM
            Expanded question about design pointsAdrian2022/11/07 09:07 AM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Anon2022/11/04 08:49 PM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512noko2022/11/04 09:49 PM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Brendan2022/11/05 02:07 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊