Sunny Cove wide

By: anon (spam.delete.delete.delete@this.this.this.spam.com), December 16, 2018 12:04 pm
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on December 16, 2018 9:57 am wrote:
> anon (spam.delete.delete.delete@this.this.this.spam.com) on December 16, 2018 9:37 am wrote:
> > Travis Downs (travis.downs.delete@this.gmail.com) on December 16, 2018 8:19 am wrote:
> > > anon (spam.delete.delete.delete@this.this.this.spam.com) on December 15, 2018 9:55 am wrote:
> > > > Travis Downs (travis.downs.delete@this.gmail.com) on December 15, 2018 9:08 am wrote:
> > > > > anon (spam.delete.delete.delete@this.this.this.spam.com) on December 14, 2018 2:57 am wrote:
> > > > > > Travis Downs (travis.downs.delete@this.gmail.com) on December 13, 2018 4:36 pm wrote:
> > > > > > > anon (spam.delete.delete.delete@this.this.this.spam.com) on December 13, 2018 3:51 pm wrote:
> > > > > > > >
> > > > > > > > We're talking about x86, you can't have multiple flags so no
> > > > > > > > matter how it's handled there's only one possible source.
> > > > > > > >
> > > > > > >
> > > > > > > In the in-order part sure, but executing out of order of course there can be many different flag values
> > > > > > > in flight at once. So flags have to be renamed just like registers or else out-of-order doesn't work.
> > > > > > >
> > > > > > > As far as I know the renamer associates writes of the flags to the same phys register as the destination
> > > > > > > of the flag-writing instruction (and this means that something like test or cmp which doesn't have
> > > > > > > a destination register still gets a renamed physical register, just for the flags). Every flag
> > > > > > > consuming instruction then gets a reference to the last written flag register.
> > > > > > >
> > > > > > > That's at least what Intel describes in their patents: other variants are possible
> > > > > > > but it appears to work reasonably close to this based on experimentation.
> > > > > > >
> > > > > >
> > > > > > The point is that renaming works differently for flags.
> > > > > > There aren't multiple flags to choose from. It's always the most recent one.
> > > > >
> > > > > Well there are multiple flags and instructions write different subsets of them, but sure it's
> > > > > not exactly the same as register renaming because it's a somewhat more restricted problem
> > > > > in that sense (although multiple flags can be written which is not true for registers).
> > > > >
> > > >
> > > > Like I said partial register problems don't apply due to the macro fusion restrictions.
> > > >
> > > > > In any case, let's assume for the sake of argument that flag renaming is much simpler at the hardware
> > > > > level (as I mentioned twice already I think it basically piggy-backs on the destination register
> > > > > renaming). The original claim was that flag renaming doesn't occur and so test/jcc-type fusion was
> > > > > therefore easier. I think we agree that flag renaming does occur, but that perhaps it is "easy",
> > > > > so how does that relate to the original discussion? What does it say about fusing mov/op?
> > > > >
> > > >
> > > > I'm not sure how you managed to interpret "there is only one most recent SR" as "there is no renaming".
> > > > If there was only one SR the qualifier "most recent" would be redundant.
> > >
> > > I was referring to this:
> > >
> > > > Implicit operands like
> > > > flags and with push/pop are dealt with in the frontend iirc so all the information is available.
> > >
> > > I thought it meant "flags are not renamed" since they are handled entirely in the front-end which
> > > is how we started this long digression. If you believe that flags are renamed, but that this
> > > is simpler in hardware than general purpose register renaming, then we probably agree.
> > >
> > > As before, I don't think the discussion of whether flags are renamed is even relevant to macro-fusion.
> > >
> > > > And they are handled as always because the jump doesn't produce a result.
> > > > This is about the result that is passed between the fused instructions.
> > > > You must be able to recognize that there's a difference between passing SR results and GPR results.
> > >
> > > Finally I think I understand the disconnect and why you keep talking about the renamer
> > > and "overwriting" registers and now passing results between the fused instructions.
> > >
> > > As I understand now you see macro-fusion as still leaving the two component ops somehow
> > > intact although "fused together", so that the renamer, for example, still needs to handle
> > > the complexity of a mov followed by an op and that's how we started discussed this.
> > >
> > > That is, that the result of macro-fusion is similar in a way to the result of micro-fusion:
> > > where both uops still exist, but are fused together for some (perhaps all) of the pipeline.
> > >
> >
> > No, I never believed that.
>
> Why are you talking about rename then? A fused mov + op is invisible to the renamer.
> It is no more challenging than any of a variety of other existing instructions.
>

See below, it might not be, depending on how LEA is handled if you wanted to handle it like that.

> >
> > > Seen in that context, this whole discussion makes sense.
> > >
> > > However, I don't think that's how macro-fusion works at all. In my understanding, a single uop pops
> > > out of decoding, and from there on it looks exactly like any other single uop that could have originated
> > > from a single instruction. I.e., there is no trace of the "fusedness" of the instruction (except to
> > > the extent that the produced op obviously doesn't come from an ISA-visible instruction).
> > >
> > > Seen in that light maybe what I'm saying is clearer: the decoder takes the mov-op pair and emits a single
> > > lea-like op for the rest of the pipeline, which doesn't retain any trace of its fused nature. There is
> > > no renaming problem because it looks like any other instruction with 2 inputs and 1 output.
> > >
> > > Maybe we are on the same page now?
> > >
> >
> > Not really.
> >
> > LEA is a special case and you know it.
> > It would be extremely weird if Intel handled LEA completely different in the decoder and everywhere
> > else so I'd assume it's more likely that LEA uses the normal adressing encoding that you'd see in
> > normal fused domain op for reg, mem. The only differences are that the destination register is not
> > needed as an input and that the SR isn't modified, so less is needed than in a normal fused reg,
> > mem op. The only special handling required after the renamer is passing it to an ALU instead of
> > an AGU. Except for 3 component LEA which can only go to port 1 it looks exactly like a standard
> > ALU op with 2 input registers and one output register which just doesn't modify the SR.
> > A macro fused jump looks exactly a normal ALU op with 2
> > registers as inputs, one or none of them as output and
> > normal SR modification. Only the opcode has to be different and the ALU needs to know how to handle it.
> >
> > So there are some questions as to how mov fusion would be implemented.
> > Where do you put the third register in the fused unrenamed uop? Does a format with 3 GPRs exist?
> > If not, do you use the fields that are usually used for adressing modes? How much special handling
> > does LEA require in the renamer? Is the second input moved to a different field to make it fit
> > in an unfused uop that goes to ALUs or can it just stay in the field that would usually go into
> > the an unfused uop that is sent to the AGUs? What about false dependencies?
> >
> > LEA is important enough to make it look like normal ALU uop after the renamer and before the renaming
> > it's just standard load encoding, but look at all the other instructions that deviate from the norm.
> > Look at those that only need the dest reg as destination, not as a source and count how many of them
> > have a false input dependency on dest. So I don't think it's as straightforward as you think.
>
> Yeah, LEA was probably a terrible example because it is different in a lot of ways.
>

I think now we're on the same page.

> Consider instead any 3-operand instruction then like bzhi, andn, pext, etc.
> These decode to 1 uop and don't seem to pose any problem. They don't execute
> on every port, but I think that's more an ALU limitation than anything.
>

I'd imagine the VEX encoding makes it a bit easier since it's closer to the SIMD formats that do require 3 registers, but it's a valid argument.

> > Where do you put the third register in the fused unrenamed uop? Does a format with 3 GPRs exist?
>
> Obviously yes since 3 argument instructions exist. After rename a 2-op destructive source op looks very much
> like a 3-op non-destructive op, so after rename I don't see much reason anything after rename should care about
> 2-op vs 3-op. For the uop format before rename, there might be cases on earlier uarches where 3-uop had some
> additional restriction or penalty due to space constraints but AFAIK anything like that is gone now.
>

Yes, but you see where this is going, right?
On Skylake if you're lucky and all your assumptions are correct it might "just work". It's not guaranteed but if you're optimistic you'd call it the most likely outcome.
Everything before that and you would've had trouble.

Compare that with jump fusion. That definitely worked since Core. So while that is completely transparent to the renamer as long as it's capable of dealing with normal ALU ops, which is a requirement for the CPU to work, mov fusion at the very least requires a renamer as sophisticated as Skylake. If something is "transparent to the renamer" but only works with the renamer of a single architecture then it's a very opaque transparency. Jump fusion and mov elimination are truly transparent, they only need a renamer to exist.

So if we go back the start of the discussion it seems like this might have actually been impossible to implement before Haswell and due to the false input dependency would not have been any faster (probably slower in too many cases) until Skylake. That means it actually would've been crazy to implement this before Skylake. Considering all the other changes in the frontend it can definitely be excused that Intel didn't implement this yet.

Even if they were pushing for it I'd only expect it on Ice Lake. A fifth full decoder seems like a better investment than only occasionally being able to decode an extra instruction. Now that decoders and uop cache actually can deliver more uops than the renamer can handle reducing the number of uops would make more sense. Maybe Intel will implement it, maybe not.

Mostly I just wanted to point out that this is not easy, it's not transparent in the way that jump fusion is and it definitely couldn't have been done at any time.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Sunny Cove wideSeni2018/12/12 02:58 PM
  Sunny Cove wideTravis Downs2018/12/12 09:25 PM
    Sunny Cove wideJeff S.2018/12/12 10:26 PM
      Sunny Cove wideTravis Downs2018/12/13 08:42 AM
        Sunny Cove wideanon2018/12/13 09:09 AM
          Sunny Cove wideTravis Downs2018/12/13 09:30 AM
            Sunny Cove wideJeff S.2018/12/13 09:40 AM
              Sunny Cove wideTravis Downs2018/12/13 09:55 AM
                Sunny Cove wideJeff S.2018/12/13 12:41 PM
                  Sunny Cove wideTravis Downs2018/12/13 02:03 PM
                  Non-power-of-two set sizesPaul A. Clayton2018/12/15 07:30 PM
        Sunny Cove wideJeff S.2018/12/13 09:33 AM
          Sunny Cove wideTravis Downs2018/12/13 09:50 AM
          What is "u-tagged"?G Adair2018/12/13 09:54 PM
            What is "u-tagged"?Travis Downs2018/12/13 11:22 PM
            What is "u-tagged"?Jeff S.2018/12/14 08:48 AM
              What is "u-tagged"?anon2018/12/14 08:51 PM
                What is "u-tagged"?Jeff S.2018/12/14 10:23 PM
                  What is "u-tagged"?anon2018/12/15 05:37 AM
                    What is "u-tagged"?anon2018/12/15 08:06 AM
                      What is "u-tagged"?Travis Downs2018/12/15 09:52 AM
                        What is "u-tagged"?anon2018/12/16 08:26 AM
                          What is "u-tagged"?Anon2018/12/18 04:25 AM
    Sunny Cove wideSeni2018/12/13 03:33 AM
      Sunny Cove wideKevin G2018/12/13 08:37 AM
        Sunny Cove wideTravis Downs2018/12/13 09:17 AM
          Sunny Cove wideKevin G2018/12/17 10:09 AM
            Sunny Cove wideTravis Downs2018/12/18 03:14 PM
              Sunny Cove wideKevin G2018/12/19 12:02 PM
      Sunny Cove wideTravis Downs2018/12/13 08:51 AM
        Sunny Cove wideMaynard Handley2018/12/13 11:25 AM
          Sunny Cove wideTravis Downs2018/12/13 12:23 PM
            Sunny Cove wideanon2018/12/13 02:01 PM
              Sunny Cove wideTravis Downs2018/12/13 02:22 PM
                Sunny Cove wideanon2018/12/13 04:51 PM
                  Sunny Cove wideTravis Downs2018/12/13 05:36 PM
                    Sunny Cove wideanon2018/12/14 03:57 AM
                      Sunny Cove wideLinus Torvalds2018/12/14 01:54 PM
                        Sunny Cove wideanon2018/12/14 04:25 PM
                          Sunny Cove wideLinus Torvalds2018/12/14 06:46 PM
                            Sunny Cove wideanon2018/12/15 02:57 AM
                              Sunny Cove wideanon2018/12/15 05:59 AM
                                Sunny Cove wideanon2018/12/15 06:59 AM
                                  Sunny Cove wideanon2018/12/15 07:03 AM
                                Sunny Cove widea_different_anon2018/12/15 07:45 AM
                              Sunny Cove wideSeni2018/12/15 06:25 AM
                                Sunny Cove wideanon2018/12/15 07:02 AM
                                  Sunny Cove wideLinus Torvalds2018/12/15 10:52 AM
                                    Sunny Cove wideanon2018/12/15 11:13 AM
                                      Sunny Cove wideTravis Downs2018/12/16 11:15 AM
                                      Sunny Cove wideanon2018/12/17 12:42 AM
                                        how many anons here? (NT)Michael S2018/12/17 02:46 AM
                      Sunny Cove wideTravis Downs2018/12/15 10:08 AM
                        Sunny Cove wideanon2018/12/15 10:55 AM
                          Sunny Cove wideTravis Downs2018/12/16 09:19 AM
                            Sunny Cove wideanon2018/12/16 10:37 AM
                              Sunny Cove wideTravis Downs2018/12/16 10:57 AM
                                Sunny Cove wideanon2018/12/16 12:04 PM
                                  Sunny Cove wideTravis Downs2018/12/16 07:51 PM
                          Sunny Cove wideTravis Downs2018/12/16 11:32 AM
        Sunny Cove wideSeni2018/12/13 04:20 PM
          Fair enough! (NT)Travis Downs2018/12/13 04:43 PM
  Sunny Cove wide-.-2018/12/13 04:37 AM
    Sunny Cove wideanon2018/12/13 09:06 AM
      Sunny Cove wideTravis Downs2018/12/13 09:39 AM
        Sunny Cove wideanon2018/12/13 12:09 PM
          Sunny Cove wideTravis Downs2018/12/13 12:27 PM
            Sunny Cove wideanon2018/12/13 01:11 PM
    Sunny Cove wideTravis Downs2018/12/13 09:23 AM
      Sunny Cove wideanonymous22018/12/13 03:20 PM
        Sunny Cove wideTravis Downs2018/12/13 05:00 PM
          Sunny Cove wideanon³2018/12/13 10:34 PM
            Sunny Cove wideTravis Downs2018/12/16 07:53 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?