Zen 4, AVX-512 support, 2 cycle execution time

By: Marcus (m.delete@this.bitsnbites.eu), August 30, 2022 10:34 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on August 30, 2022 12:45 am wrote:
> anonymous2 (anonymous2.delete@this.example.com) on August 29, 2022 5:08 pm wrote:
> > AVX-512 (ISA details murky) on Zen 4 but 2 cycles vs 1 on Intel so only 256b internally.
> >
> > Small win for those who want the ISA, but from a performance perspective limited value?
> >
>
>
> We do not know yet if there is any 2-cycle execution time for AVX-512.
>
> What we know is that Zen 4 has the same execution resources as Zen 3, and that most changes have
> been done only in the frontend. Some unspecified changes have been done also for load/store.
>
> For most wide operations, Zen 3 has either four 256-bit pipelines or two 256-bit pipelines.
>
> It is possible to implement a 512-bit operation using 2 cycles of the same pipeline.
> In that case, a 512-bit operation can be initiated every other cycle in the same pipeline,
> while a 256-bit operation can be initiated every cycle in the same pipeline.
>
> It is also possible to implement a 512-bit operation by using simultaneously two 256-bit pipelines.
> In that case, when executing 512-bit instructions, only either 1 or 2 instructions can be initiated
> per cycle, instead of either 2 or 4 instructions per cycle, as possible for 256-bit instructions.
>
>
> The latter variant seems a more likely implementation. That is also how Intel does this.
>
> So, I do not believe that for 512-bit instructions Zen 4 offers either two (for FMA or MUL) or four 512-bit
> pipelines with 2-cycle throughput. I believe that it offers either one (for FMA or MUL) or two (for simple
> AVX-512 instructions) 512-bit pipelines with 1-cycle throughput, like Intel in their non-server CPUs.
>

People appear to be staring themselves blind on 512 = 2 x 256, thus twice the performance..?

There are many advantages to splitting up a wide vector operation into two (or more) sequential steps. Firstly, you can hide/reduce data dependency latencies, which means that you do not have to resort to the ridiculous unrolling that you can see in some SIMD code, and you don't need as wide OoO windows, which in turn translates to lower I$ pressure and simpler hardware. This also means that you can "pipeline" or "chain" instructions (start the next instruction as soon as the first 256b package of an instruction is ready). And of course you do not *need* to have 512b wide work packages to process 512 bits in parallel, you can just as well ensure that more 256b operations can start concurrently.

In the end the most important thing is to get the ISA advantages of AVX-512 (masking etc), and you *will* see a performance uplift, albeit not necessarily the 50-60% (?) increase seen on some Intel parts.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Zen 4, AVX-512 support, 2 cycle execution timeanonymous22022/08/29 05:08 PM
  Zen 4, AVX-512 support, 2 cycle execution timeFreddie2022/08/29 05:32 PM
    Zen 4, AVX-512 support, 2 cycle execution timenoko2022/08/29 11:54 PM
    Zen 4, AVX-512 support, 2 cycle execution timeIvan2022/08/30 12:00 AM
  HPC code is moving to GPUs ...Mark Roulo2022/08/29 06:26 PM
    HPC code is moving to GPUs ...Adrian2022/08/30 01:12 AM
      HPC code is moving to GPUs ...me2022/08/30 08:17 AM
        HPC code is moving to GPUs ...Adrian2022/08/30 10:23 AM
          HPC code is moving to GPUs ...me2022/08/30 12:06 PM
            HPC code is moving to GPUs ...Anon2022/08/30 12:34 PM
              HPC code is moving to GPUs ...me2022/08/30 04:23 PM
          HPC code is moving to GPUs ...Björn Ragnar Björnsson2022/08/30 01:17 PM
  Zen 4, AVX-512 support, 2 cycle execution timeAdrian2022/08/30 12:45 AM
    Zen 4, AVX-512 support, 2 cycle execution timeMarcus2022/08/30 10:34 AM
  Zen 4 LD/ST enhancementsAdrian2022/08/31 01:25 AM
    Zen 4 LD/ST enhancements that contribute to the IPC imprvement have nothing to do with AVX-512Heikki Kultala2022/08/31 07:38 AM
      Zen 4 LD/ST enhancements that contribute to the IPC imprvement have nothing to do with AVX-512Marcus2022/08/31 08:55 AM
        Zen 4 LD/ST enhancements that contribute to the IPC imprvement have nothing to do with AVX-512Adrian2022/08/31 10:30 AM
          Zen 4 LD/ST enhancements that contribute to the IPC imprvement have nothing to do with AVX-512Ivan2022/09/01 02:21 AM
            The result is for 2-socket system, not single processorHeikki Kultala2022/09/01 08:31 AM
      Zen 4 LD/ST enhancements that contribute to the IPC imprvement have nothing to do with AVX-512Adrian2022/08/31 10:10 AM
        Zen 4 LD/ST enhancements that contribute to the IPC imprvement have nothing to do with AVX-512Anon2022/08/31 02:24 PM
          Zen 4 LD/ST enhancements that contribute to the IPC imprvement have nothing to do with AVX-512noko2022/08/31 03:21 PM
            Zen 4 LD/ST enhancements that contribute to the IPC imprvement have nothing to do with AVX-512Adrian2022/08/31 11:58 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊