Sunny Cove wide

By: Maynard Handley (, December 13, 2018 11:25 am
Room: Moderated Discussions
Travis Downs ( on December 13, 2018 7:51 am wrote:
> Seni ( on December 13, 2018 2:33 am wrote:
> > Travis Downs ( on December 12, 2018 8:25 pm wrote:
> > > Seni ( on December 12, 2018 1:58 pm wrote:
> > > > -2nd store port. As far as x86 is concerned, this is (probably?) the only increase
> > > > in store ports that has ever occurred. It's unclear how important store port count
> > > > is to performance, since there is no precedent to base an estimate on.
> > > >
> > > >
> > > > -5-wide renamer. The wording I've seen is extremely vague and possibly misleading, but
> > > > it sounds like they increased the renamer from 4 work units to 5. A renamer work unit
> > > > is not the same as either an instruction or a uop, though, so a lot unresolved questions
> > > > here. There is no sign of a breakthrough renaming technique. Just wider.
> > >
> > > The work unit is "fused uop". I.e., something like an ALU ok with a memory-source operand
> > > counts as 1 for renaming, even though it will execute as two separate (unfused) uops.
> > >
> > > At least that has been the case going back to at least SNB.
> >
> > Are you certain that there have been no changes in the details
> > of what can be fused and where the fusion occurs?
> About Sunny Cove (let's just say Ice Lake?), no I have no idea. I was talking about existing chips.
> That is, when you said "A renamer work unit is not the same as either an instruction
> or a uop, though" I had thought you were talking about both the present and the
> future - i.e., that this "work unit" concept applied today as well.
> Today, we know what the work unit is: it's the fused uop [1]
> If I had to guess, it will be approximately the same in ICL. It's hard to see it getting worse,
> and it is also hard to see there being significantly more fusion, unless a new type of fusion is
> introduced. At most, I guess we'll get a reduction in the cases where delamination occurs.

> [1] Well, it's slightly more subtle than that, because there are two flavors of micro-fused uops: those
> that stay fused during allocation and those that don't (delamination) - and those details have changed from
> SNB to HSW to SKL (not sure about SKX). There are also register read limitations: the RAT can only handle
> so many registers per cycle - although it is hard to hit that limit w/o an artificial benchmark.


What is not being aggressively pushed (as far as I know) is the sort of fusion that rewrites an intermediate register, something like
rA= rB op1 RC
rA= rD op2 rA going to
rA= rD op2 (rB op1 rC)
The point of this exercise is that you only have to perform the rA allocation once, so when it happens you get more throughout through your renamer.

This may not be THAT valuable for x86 if it's still limited in front-end (but µop cache is supposed to help with that much of the time?); and I think it becomes more valuable as ARM decode width grows crazy high (now 7 on Apple). Even if you do nothing further with the fused op (so it still occupies two execution slots) you've won at what's probably the most significant pain point right now.
