Expanded question about design points

By: Adrian (a.delete@this.acm.org), November 5, 2022 3:00 am
Room: Moderated Discussions
Jeffrey Bosboom (firstinitiallastname.delete@this.firstnamelastname.com) on November 4, 2022 10:37 pm wrote:
> Mark Roulo (nothanks.delete@this.xxx.com) on November 4, 2022 8:34 pm wrote:
> > Is your question: Why would a CPU not allow two independent 256-bit vector instructions
> > to execute simultaneously in the top and bottom halves of a 512-bit vector?
>
> Sorry, my question is a bit confused because I am a bit confused. Let me expand a bit,
> and you can correct me where I'm wrong or explain how I'm not seeing this correctly.
>
> I see (at least) four design points here:
>
> 0) One 256-bit unit. Crack 512-bit instructions into two 256-bit uops, executed sequentially.
> Minimizes execution unit area and register file port count and width while supporting
> 512-bit ISA for decreased code size or software compatibility.
>
> 1) Two 256-bit units. Crack 512-bit instructions into two 256-bit uops, scheduled however they're
> mixed with 256-bit uops. Increases execution unit utilization by allowing [256, 512 first half]
> [512 second half, 256] pairing; requires more but narrower register file ports.
>
> 2) One 512-bit unit that can execute one 512-bit instruction or two 256-bit instructions. Allows
> a single 256-bit instruction to block a 512-bit instruction, leaving half the unit idle (or the scheduler
> stalls the 256-bit instruction until it can pair, increasing latency), but not cracking 512-bit means
> fewer uops through the pipeline and in the uop cache. Requires the same number of register file
> ports as 1) when executing 256-bit ops, but also needs wide ports for 512-bit ops.
>
> 3) One 512-bit unit that can execute one instruction regardless of width. Wastes half of the
> unit when executing 256-bit instructions. Requires fewer but wider register file ports.
>
> I can see why a designer would choose 0 for a low-perf, low-area design. Of
> the other three, 1 seems clearly better than 2 or 3. So my questions are:
>
> - Why would a designer choose 2 over 1?
> - Why would a designer choose 3 over 1 or 2?


First of all, for the AMD designers the set of the design choices was much more restricted.

They already had since Zen 3 execution units that did four 256-bit operations per clock cycle. A few operations, e.g. FMA and multiplications were restricted to only two of the four pipelines, so for those operations two 256-bit could be done per cycle.


When implementing 512-bit operations on these four 256-bit pipelines, there are only 2 choices, after splitting the 512-bit operation in two 256-bit operations.

Either the two 256-bit operations are executed in the same pipeline, in two consecutive clock cycles, or they are executed in two pipelines, in the same clock cycle.

The throughput will be identical, regardless of implementation.

The latency of various instruction sequences could be affected by up to 1 clock cycle, but devising a test to expose this is not simple.

The choice of the execution manner between the 2 variants may be not deterministic, if the splitting of the 512-bit operations is done early and then the 256-bit operations are scheduled on whatever pipelines are available in each clock cycle.

If the splitting is done only immediately prior to execution, after scheduling, then using two consecutive cycles in the same pipeline needs a little more hardware for sequencing, but it does not need to wait for two pipelines to be simultaneously available. This solution seems better for some code that would mix 256-bit and 512-bit instructions, by allowing all the pipelines to be fully used even in such cases.

In the Intel optimization manuals, the 512-bit execution is described as the coupling of a pair of 256-bit pipelines into a 512-bit pipeline, and the 512-bit operations are scheduled on them, without splitting.

Zen 4 could also use the method described by Intel, by coupling the four 256-bit pipelines into two 512-bit pipelines, onto which the 512-bit operations are scheduled, with the restrictions that a few operations, e.g. FMA and MUL, can be executed in only one of the pipelines.

This method would be optimal, by requiring the least hardware, for code that only uses 512-bit instructions without interleaving them with 256-bit operations (which would be a very stupid programming style).

In my opinion, the right way would be to ignore stupid programs that would interleave operations with different sizes and couple the 256-bit pipelines into 512-bit pipelines, and schedule the 512-bit operations as a unit.

Nevertheless, it is possible that, for maximum versatility, AMD could have chosen to split the 512-bit operations before scheduling, in which case the two halves could be executed either sequentially or simultaneously, for no difference in throughput. They might have done this, but I hope that they did not do this.













< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Jeffrey Bosboom2022/11/04 06:18 PM
  Clarification?Mark Roulo2022/11/04 08:34 PM
    Expanded question about design pointsJeffrey Bosboom2022/11/04 10:37 PM
      Expanded question about design pointsAnon2022/11/04 10:53 PM
        Expanded question about design pointsJeffrey Bosboom2022/11/04 11:05 PM
          Expanded question about design pointsAnon2022/11/04 11:30 PM
            Expanded question about design pointsChester2022/11/05 04:24 PM
              Expanded question about design pointsAnon2022/11/05 04:43 PM
              Expanded question about design pointsLinus Torvalds2022/11/06 02:18 PM
                Expanded question about design pointsAdrian2022/11/07 04:38 AM
                  Expanded question about design pointsanon2022/11/07 12:34 PM
                    Expanded question about design pointsAdrian2022/11/08 04:34 AM
                      Expanded question about design pointsChester2022/11/08 08:29 AM
                      Expanded question about design pointsanon2022/11/08 09:01 AM
                        Expanded question about design pointsAdrian2022/11/08 09:53 AM
                          Expanded question about design pointsLinus Torvalds2022/11/08 11:35 AM
                            Expanded question about design pointsBrett2022/11/08 12:33 PM
                              Expanded question about design pointsBrett2022/11/08 12:48 PM
                              Expanded question about design points---2022/11/09 11:41 AM
                            Expanded question about design pointsAdrian2022/11/08 12:45 PM
                              Expanded question about design pointsLinus Torvalds2022/11/08 01:29 PM
                                Expanded question about design pointsanon2022/11/08 01:58 PM
                              Zen 4cJames2022/11/09 03:54 AM
                                Zen 4cAndrew Clough2022/11/09 05:59 AM
                                  Zen 4canonymou52022/11/09 12:29 PM
                                    Zen 4cChester2022/11/09 09:12 PM
                            Expanded question about design pointsBjörn Ragnar Björnsson2022/11/08 09:24 PM
                              FP Adders are not so cheap compared to FP multipliersHeikki Kultala2022/11/09 09:07 AM
                                FP Adders are not so cheap compared to FP multipliersBjörn Ragnar Björnsson2022/11/10 12:10 AM
                          Expanded question about design pointsAnon2022/11/08 06:31 PM
      Expanded question about design pointsAdrian2022/11/05 03:00 AM
        Expanded question about design pointsAnon2022/11/05 03:27 AM
          Expanded question about design pointsAdrian2022/11/05 03:50 AM
            Expanded question about design pointsAnon2022/11/05 04:10 AM
              Expanded question about design pointsAdrian2022/11/05 07:34 AM
        Expanded question about design pointshobold2022/11/06 04:48 AM
          Expanded question about design pointsAdrian2022/11/07 04:19 AM
            Expanded question about design pointsAdrian2022/11/07 09:07 AM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Anon2022/11/04 08:49 PM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512noko2022/11/04 09:49 PM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Brendan2022/11/05 02:07 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊