Sunny Cove wide

By: Kevin G (, December 17, 2018 10:09 am
Room: Moderated Discussions
Travis Downs ( on December 13, 2018 8:17 am wrote:
> Kevin G ( on December 13, 2018 7:37 am wrote:
> > Seni ( on December 13, 2018 2:33 am wrote:
> > > Travis Downs ( on December 12, 2018 8:25 pm wrote:
> > > > Seni ( on December 12, 2018 1:58 pm wrote:
> > > > > -2nd store port. As far as x86 is concerned, this is (probably?) the only increase
> > > > > in store ports that has ever occurred. It's unclear how important store port count
> > > > > is to performance, since there is no precedent to base an estimate on.
> > > > >
> > > > >
> > > > > -5-wide renamer. The wording I've seen is extremely vague and possibly misleading, but
> > > > > it sounds like they increased the renamer from 4 work units to 5. A renamer work unit
> > > > > is not the same as either an instruction or a uop, though, so a lot unresolved questions
> > > > > here. There is no sign of a breakthrough renaming technique. Just wider.
> > > >
> > > > The work unit is "fused uop". I.e., something like an ALU ok with a memory-source operand
> > > > counts as 1 for renaming, even though it will execute as two separate (unfused) uops.
> > > >
> > > > At least that has been the case going back to at least SNB.
> > >
> > > Are you certain that there have been no changes in the details
> > > of what can be fused and where the fusion occurs?
> >
> > The last big change to cracking and fusion was with Nehalem when it began to work in 64
> > bit mode. Where this occurs at least to that generation was with neighboring instructions
> > in very select (but common) instances. Since then Intel has been rather quiet on these
> > details so the presumption was that they weren't focusing too much on them.
> You are talking about macro-fusion: fusing two+ instructions into a single uop. The
> type of fusion that we were talking about as being relevant for allocation is more about
> micro-fusion: fusing the two uops that come from a single instruction in various front-end
> structures and at rename time but not in the scheduler or at execution.
> When we say "fused domain" or "unfused domain" it is referring to micro-fusion.
> Macro-fusion is very simple in this regard: two instructions are fused into a single op and this op counts
> as 1 forever after that (note that it is possible to have both micro and macro in the same uop).

Yeah, this is confusing as Intel does both and they can be done together.

The details on the micro-op side are scarce as one would expect as they are not programmer facing and could change in future implementations.

> > However, there does appear to be a few cases for further improvement. Only one fused op can be used at
> > once in Sandy Bridge and previous designs. Not sure how often two fused ops can appears next to each
> > other but it'd be a minor boost. Intel could expand the window for fusing instructions instead of nearest
> > neighbors. I fathom that this would be rather complex to implement and the power cost may exceed the
> > performance gains it would extract. On Haswell and later, there is the possibility that multiply-add
> > could be fused together.
> Note that FMA and separate mul + add don't give the same results, due to intermediate rounding.
> So you can't just replace the latter with the former without affecting the documented IEEE
> results. However, if such fusion was useful, it would probably be easy to have a flag in
> your FMA unit that did the intermediate rounding so you get the same answer.

Agreed, the same mechanism that fuses instructions can also set the appropriate rounding flags.

> However, I think FP mul + add is one of the least useful cases to macro fuse: since you already have
> the explicit fused instruction exposed in the ISA, and almost everyone who cares about performance will
> already be using it. So fusion would only help the small set of people who didn't care enough to make
> sure they were using the FMA opcode (and perhaps some old binaries compiled prior to FMA support).

I would disagree here as it was relatively recently (AVX2) that three operand instructions were introduced to even make FMA possible. With CPUs on the market today without AVX support, there will be developers that'll target the lowest common denominator and leave such optimizations on the table in favor of simplicity. This also ignores the decades of legacy software which could see a slight boost in performance by incorporating multiply-add fusion.

> The real power of fusion is to create single-uop operations that don't exist in the ISA - then
> you help even well-written code without bloating your ISA will all sorts of compound operations.


That is the thing for x86 though, you already are dealing with a large amount of bloat in programmer accessible space. This is generally the case for micro-ops in the first place as several different instructions can map reasonably into a common micro-op (see SSE and AVX overlaps).

> > For legacy FP code, this code be a nice big win. Integer multiply-add would
> > be a small but seemingly straight forward boost. Similarly similar scalar FP operations could be merged
> > into a single vector operation for legacy code. This would certainly involve more complexity with the
> > big gains limited to older code. With newer code, there
> > could be incentive to fuse multiple vector operations
> > into a larger vector op (two independent 256 bit vector adds
> > to one 512 bit vector add). Again, complexity/power
> > vs. performance gain it does seem like it would be an immediate win off hand.
> This type of merging scalar or narrow SIMD op into wider SIMD ops (basically hardware auto-vectorization)
> is not very easy. The problem is that the scalar ops have separately renamed destination registers (also
> inputs), while the merged op result will end in a single wide SIMD register. You could imagine a type of
> fusion that still does the renaming in the usual way, but does the ALU op with the wide SIMD EU, and stuffs
> the results back into the various output registers: but this will just bottleneck on the register gather/scatter
> of inputs and outputs and renaming since that is very limited wrt SIMD width on modern chips.

I agree.

I accidentally a word there. My last line should read "Again, complexity/power vs. performance gain, it does not seem like it would be an immediate win off hand."

Thinking more about this, it could be done but that ventures into the rabbit hole of bad ideas. Conceptually it is possible but running the scenario in my head doesn't point to any sort of clear performance victory nor power savings.

Speaking of running scenarios, there is a similar in concept of breakdown the wider 512 bit vectors into 128 bit components. This includes virtualizing the 512 bit registers that are program accessible into the smaller hardware implementations. The execution units are similary broken down into 128 bit wide and increased in number to be able to match the expected throughput. The obvious problem is that the core becomes very, very wide in terms of dispatch and execution ports. However, such a core would work exceptionally well with legacy code (SSE/AVX1/AVX2) and play nice with high SMT count. As wider instructions are encountered, decreasing SMT count to balance throughput is an interesting possibility. Again, this is an idea which I don't think would pay off in performance/watt but this idea does have the potential to be faster.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Sunny Cove wideSeni2018/12/12 02:58 PM
  Sunny Cove wideTravis Downs2018/12/12 09:25 PM
    Sunny Cove wideJeff S.2018/12/12 10:26 PM
      Sunny Cove wideTravis Downs2018/12/13 08:42 AM
        Sunny Cove wideanon2018/12/13 09:09 AM
          Sunny Cove wideTravis Downs2018/12/13 09:30 AM
            Sunny Cove wideJeff S.2018/12/13 09:40 AM
              Sunny Cove wideTravis Downs2018/12/13 09:55 AM
                Sunny Cove wideJeff S.2018/12/13 12:41 PM
                  Sunny Cove wideTravis Downs2018/12/13 02:03 PM
                  Non-power-of-two set sizesPaul A. Clayton2018/12/15 07:30 PM
        Sunny Cove wideJeff S.2018/12/13 09:33 AM
          Sunny Cove wideTravis Downs2018/12/13 09:50 AM
          What is "u-tagged"?G Adair2018/12/13 09:54 PM
            What is "u-tagged"?Travis Downs2018/12/13 11:22 PM
            What is "u-tagged"?Jeff S.2018/12/14 08:48 AM
              What is "u-tagged"?anon2018/12/14 08:51 PM
                What is "u-tagged"?Jeff S.2018/12/14 10:23 PM
                  What is "u-tagged"?anon2018/12/15 05:37 AM
                    What is "u-tagged"?anon2018/12/15 08:06 AM
                      What is "u-tagged"?Travis Downs2018/12/15 09:52 AM
                        What is "u-tagged"?anon2018/12/16 08:26 AM
                          What is "u-tagged"?Anon2018/12/18 04:25 AM
    Sunny Cove wideSeni2018/12/13 03:33 AM
      Sunny Cove wideKevin G2018/12/13 08:37 AM
        Sunny Cove wideTravis Downs2018/12/13 09:17 AM
          Sunny Cove wideKevin G2018/12/17 10:09 AM
            Sunny Cove wideTravis Downs2018/12/18 03:14 PM
              Sunny Cove wideKevin G2018/12/19 12:02 PM
      Sunny Cove wideTravis Downs2018/12/13 08:51 AM
        Sunny Cove wideMaynard Handley2018/12/13 11:25 AM
          Sunny Cove wideTravis Downs2018/12/13 12:23 PM
            Sunny Cove wideanon2018/12/13 02:01 PM
              Sunny Cove wideTravis Downs2018/12/13 02:22 PM
                Sunny Cove wideanon2018/12/13 04:51 PM
                  Sunny Cove wideTravis Downs2018/12/13 05:36 PM
                    Sunny Cove wideanon2018/12/14 03:57 AM
                      Sunny Cove wideLinus Torvalds2018/12/14 01:54 PM
                        Sunny Cove wideanon2018/12/14 04:25 PM
                          Sunny Cove wideLinus Torvalds2018/12/14 06:46 PM
                            Sunny Cove wideanon2018/12/15 02:57 AM
                              Sunny Cove wideanon2018/12/15 05:59 AM
                                Sunny Cove wideanon2018/12/15 06:59 AM
                                  Sunny Cove wideanon2018/12/15 07:03 AM
                                Sunny Cove widea_different_anon2018/12/15 07:45 AM
                              Sunny Cove wideSeni2018/12/15 06:25 AM
                                Sunny Cove wideanon2018/12/15 07:02 AM
                                  Sunny Cove wideLinus Torvalds2018/12/15 10:52 AM
                                    Sunny Cove wideanon2018/12/15 11:13 AM
                                      Sunny Cove wideTravis Downs2018/12/16 11:15 AM
                                      Sunny Cove wideanon2018/12/17 12:42 AM
                                        how many anons here? (NT)Michael S2018/12/17 02:46 AM
                      Sunny Cove wideTravis Downs2018/12/15 10:08 AM
                        Sunny Cove wideanon2018/12/15 10:55 AM
                          Sunny Cove wideTravis Downs2018/12/16 09:19 AM
                            Sunny Cove wideanon2018/12/16 10:37 AM
                              Sunny Cove wideTravis Downs2018/12/16 10:57 AM
                                Sunny Cove wideanon2018/12/16 12:04 PM
                                  Sunny Cove wideTravis Downs2018/12/16 07:51 PM
                          Sunny Cove wideTravis Downs2018/12/16 11:32 AM
        Sunny Cove wideSeni2018/12/13 04:20 PM
          Fair enough! (NT)Travis Downs2018/12/13 04:43 PM
  Sunny Cove wide-.-2018/12/13 04:37 AM
    Sunny Cove wideanon2018/12/13 09:06 AM
      Sunny Cove wideTravis Downs2018/12/13 09:39 AM
        Sunny Cove wideanon2018/12/13 12:09 PM
          Sunny Cove wideTravis Downs2018/12/13 12:27 PM
            Sunny Cove wideanon2018/12/13 01:11 PM
    Sunny Cove wideTravis Downs2018/12/13 09:23 AM
      Sunny Cove wideanonymous22018/12/13 03:20 PM
        Sunny Cove wideTravis Downs2018/12/13 05:00 PM
          Sunny Cove wideanon³2018/12/13 10:34 PM
            Sunny Cove wideTravis Downs2018/12/16 07:53 PM
Reply to this Topic
Body: No Text
How do you spell purple?