Expanded question about design points

By: --- (---.delete@this.redheron.com), November 9, 2022 11:41 am
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on November 8, 2022 11:33 am wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 8, 2022 10:35 am wrote:
> > Adrian (a.delete@this.acm.org) on November 8, 2022 8:53 am wrote:
> > >
> > > I am skeptical that AMD has chosen the variant with sequential processing of the halves, because
> > > that creates problems for the few instructions that need to access both halves.
> >
> > So I would actually love to hear that that is what AMD does,
> > because I think it's conceptually a lovely model.
> >
> > It's literally the original traditional vector model, where you treat vectors not as one thing, but as a
> > sequence of things. That's a model that actually scales, in that it doesn't penalize the smaller case.
> >
> > I'm certainly on record as not being a huge fan of AVX512, but any implementation
> > that makes the effort to also scale down is a good implementation in my book.
> >
> > The "do things sequentially" has almost no cost in the common case, when the halves actually
> > contain independent data. Sure, it doesn't do things like full-width cross-lane operations
> > very naturally, and we've seen how some people on this forum absolutely love the shuffle
> > operations, but let's be honest: shuffle fundamentally does not scale.
> >
> > I think vector people often forget how special they are. And I mean "your mom told you you were special"
> > kind of special. Non-vector code still - and probably forever - dominates hugely, and even in the vector
> > world, AVX512 is certainly not the big dog and is almost entirely a "look, ma, benchmarks" thing.
> >
> > > Only for store operations we know for sure that the halves are processed sequentially,
> > > due to the 256-bit path to the L1 cache, and there the 512-bit operations are
> > > split early into 256-bit operations, not at the execution time.
> >
> > Sure, but that may just be a random internal design decision, where (a) the memory
> > pipeline is designed and optimized for smaller units (which is good - because
> > those are the common case by far), and (b) is a clearly separate unit.
> >
> > IOW, it's entirely possible that the vector units basically do the same sequential thing, but because
> > for them it's internal, it's not nearly as visible in other micro-architectural details.
> >
> > The memory unit choice will be very visible in things like
> > store buffer sizing experiments, number of outstanding
> > cache accesses etc etc. And it's probably even visible as separate uops (since it's now a cross-unit thing
> > and thus presumably tracked that way), it will stand out in all the basic performance counters too.
> >
> > In contrast, some sequential operation inside the vector units is much more subtle, particularly since
> > in most cases you still end up with that effective single-cycle latency if you just forward the low 256
> > bits between units. So it's basically much less visible just because it's done at a more local level.
> >
> > It's basically not really different from some operations being single-cycle and others being multiple
> > cycles, and that's already something that the vector unit has to deal with anyway. The only new thing
> > is how part of the data comes out a cycle earlier, and even that isn't really unheard of.
> >
> > We've seen those kinds of single-cycle skews all over the place before, to the point where people just
> > take them for granted and don't even mention them (eg memory units often have the "store address vs store
> > data" skew, regular integer ALU's often have a "result data vs flags data", and many pipelines have things
> > like "I can forward this in once cycle within a cluster, but need two cycles between clusters").
> >
> > So once it's a "within this unit", you'll seldom even see a lot of discussion about
> > how some cases may need a cycle or two, because it's usually not visible in the common
> > case, since it's all been designed to not show the hiccups in that case.
> >
> > And I'd much rather see a sequential model with the low bit results available
> > early than something where units are tied together (or worse yet: full-width
> > vector units that don't do two half-width operations in parallel at all).
> >
> > Netburst showed that it could work even for integer results - the P4 had a lot of problems,
> > but the double-pumped ALU was interesting, and was not the primary pain point (it didn't help
> > all the other design problems of course, and did make for scheduling issues, so I'm not claiming
> > it worked flawlessly, I'm just saying that the real problems were elsewhere).
> >
> > But let's wait for more hard numbers to see what AMD actually did.
> >
> > Linus
> With a MAC instruction you need three or four read ports and thus are borrowing read ports
> from the previous cycle of the ALU. This does not force you to crack the instruction.
> So you can just have one instruction that takes two cycles and reads four registers
> 256 bits wide to produce two 256 bit results for the 512 needed. You have 256 bit reads
> and writes, and all of this will interleave just fine with 256 bit operations.
> And sequential operations can half overlap, so no stalls. Low 265 half always first so the next
> instruction is fine and does not care that the high 256 half comes next cycle as needed.
> 20/20 hindsight. ;)
> I have been lobbying for instructions with four reads and two writes for decade(s), opcode
> merging of this type is one of the only ways to improve performance. Shift and add, two combined
> adds/subtracts, etc. This saves a tracking slot and read and write ports, verses two independent
> operations. Of course I want to do this in one cycle, but the idea is the same.

You can get most of the value with pairs that reuse the destination operand. So three input one output, ie rD=rA op1 rB; rD=rC op2 rC

And Apple appears to do this. At least there is a trail of LLVM evidence to this effect:

gives Apple's official claim for LLVM as to supported instructions and performance tweaks (most important fusions)

The patterns implemented by these fusions are described in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64MacroFusion.cpp
and https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64.td states that the A14 has them.
(Look for isArithmeticLogicPair in AArch64MacroFusion.cpp.)

It's still unclear the extent to which these are implemented in A14/15/16 vs aspirational (ie getting code ready for future CPUs.)
I used to be sure they were not implemented; I'm no longer sure, having looked at the various code that has been used to test their presence and realized ways in which it may not be testing what the author thought it was testing.
(For example, the author may be testing latency of a sequence of these arithmetic+logic fusions and seeing no improvement BUT there may be a substantial resource amplification [ie only half the number of destination registers and/or ROB slots are allocated, but you won't see that if you don't look for it].)
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Jeffrey Bosboom2022/11/04 06:18 PM
  Clarification?Mark Roulo2022/11/04 08:34 PM
    Expanded question about design pointsJeffrey Bosboom2022/11/04 10:37 PM
      Expanded question about design pointsAnon2022/11/04 10:53 PM
        Expanded question about design pointsJeffrey Bosboom2022/11/04 11:05 PM
          Expanded question about design pointsAnon2022/11/04 11:30 PM
            Expanded question about design pointsChester2022/11/05 04:24 PM
              Expanded question about design pointsAnon2022/11/05 04:43 PM
              Expanded question about design pointsLinus Torvalds2022/11/06 02:18 PM
                Expanded question about design pointsAdrian2022/11/07 04:38 AM
                  Expanded question about design pointsanon2022/11/07 12:34 PM
                    Expanded question about design pointsAdrian2022/11/08 04:34 AM
                      Expanded question about design pointsChester2022/11/08 08:29 AM
                      Expanded question about design pointsanon2022/11/08 09:01 AM
                        Expanded question about design pointsAdrian2022/11/08 09:53 AM
                          Expanded question about design pointsLinus Torvalds2022/11/08 11:35 AM
                            Expanded question about design pointsBrett2022/11/08 12:33 PM
                              Expanded question about design pointsBrett2022/11/08 12:48 PM
                              Expanded question about design points---2022/11/09 11:41 AM
                            Expanded question about design pointsAdrian2022/11/08 12:45 PM
                              Expanded question about design pointsLinus Torvalds2022/11/08 01:29 PM
                                Expanded question about design pointsanon2022/11/08 01:58 PM
                              Zen 4cJames2022/11/09 03:54 AM
                                Zen 4cAndrew Clough2022/11/09 05:59 AM
                                  Zen 4canonymou52022/11/09 12:29 PM
                                    Zen 4cChester2022/11/09 09:12 PM
                            Expanded question about design pointsBjörn Ragnar Björnsson2022/11/08 09:24 PM
                              FP Adders are not so cheap compared to FP multipliersHeikki Kultala2022/11/09 09:07 AM
                                FP Adders are not so cheap compared to FP multipliersBjörn Ragnar Björnsson2022/11/10 12:10 AM
                          Expanded question about design pointsAnon2022/11/08 06:31 PM
      Expanded question about design pointsAdrian2022/11/05 03:00 AM
        Expanded question about design pointsAnon2022/11/05 03:27 AM
          Expanded question about design pointsAdrian2022/11/05 03:50 AM
            Expanded question about design pointsAnon2022/11/05 04:10 AM
              Expanded question about design pointsAdrian2022/11/05 07:34 AM
        Expanded question about design pointshobold2022/11/06 04:48 AM
          Expanded question about design pointsAdrian2022/11/07 04:19 AM
            Expanded question about design pointsAdrian2022/11/07 09:07 AM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Anon2022/11/04 08:49 PM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512noko2022/11/04 09:49 PM
  One 512-bit vector unit versus 2 256-bit vector units, re Zen 4 AVX-512Brendan2022/11/05 02:07 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊