By: Maynard Handley (name99.delete@this.name99.org), July 6, 2015 12:07 pm
Room: Moderated Discussions
Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on July 6, 2015 11:17 am wrote:
> Maynard Handley (name99.delete@this.name99.org) on July 6, 2015 10:25 am wrote:
> > Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on July 6, 2015 2:11 am wrote:
> > > Maynard Handley (name99.delete@this.name99.org) on July 5, 2015 6:01 pm wrote:
> > > > Interestingly I did not see any OTHER fusion possibilities in the code. In particular the possibility
> > > > IBM selected (fusing instructions to create a large immediate) is not utilized, which might, of course
> > > > reflect not enough time to add this to the design; but maybe also reflects something about ARM's constant
> > > > generation and so a less frequent need for generating immediates through successive instructions.
> > >
> > > In POWER8's case instruction fusion applies to add immediate / add immediate
> > > shifted + loads which covers a weakness in the ISA which might not apply to
> > > ARM (lack of base + immediate addressing for certain types of load).
> > >
> > > IBM has also been doing more aggressive fusion such as turning conditional-jump-over-a-single-instruction
> > > sequences into a single predicated µop (the instruction can be an add, and, or, xor plus some
> > > immediate forms as well as a store). See the POWER8 user manual section 10.1.4.7.
> > >
> > > Beside being different ISAs, IBM doesn't have many constraints as far as power
> > > goes so the tradeoffs they're making might not apply to Apple's design.
> >
> > The IBM branch over one instruction is neat, but, like you said
> > for forming immediates, it reflects a hole in the iSA.
> > I am guessing that Apple (and the whole ARM camp)'s answer to that is to use a conditional
> > select. And of course Apple, in particular, have very little of a legacy code problem...
> >
> > That does raise the issue that perhaps the next lowest lying fruit for ARM op-fusion
> > might be op+predicated move as a single unit? That and three input add seem like
> > the most common patterns left after compare and branch have been handled.
>
> Various ARM cores fuse cmp+bcc, mov+movk, adrp+ldr, adrp+add already. There aren't that many cases
> left that make sense - most fusions are already in the architecture (ldp, writeback, shift+alu etc),
> and you also need to form instructions that are simple enough not to need splitting later.
Do we know for a FACT that all the options you are suggesting are in use (as opposed to "the sensible ones to use")?
That was the point of my initial post --- that we have some sort of confirmation about what Apple is doing in Cyclone. As far as I know, beyond that, it's all speculation. For example A72 is supposed to have enhanced fusion, but I've seen nothing beyond that detail as to what is being fused (and how it's enhanced beyond what gets fused in A57).
> Maynard Handley (name99.delete@this.name99.org) on July 6, 2015 10:25 am wrote:
> > Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on July 6, 2015 2:11 am wrote:
> > > Maynard Handley (name99.delete@this.name99.org) on July 5, 2015 6:01 pm wrote:
> > > > Interestingly I did not see any OTHER fusion possibilities in the code. In particular the possibility
> > > > IBM selected (fusing instructions to create a large immediate) is not utilized, which might, of course
> > > > reflect not enough time to add this to the design; but maybe also reflects something about ARM's constant
> > > > generation and so a less frequent need for generating immediates through successive instructions.
> > >
> > > In POWER8's case instruction fusion applies to add immediate / add immediate
> > > shifted + loads which covers a weakness in the ISA which might not apply to
> > > ARM (lack of base + immediate addressing for certain types of load).
> > >
> > > IBM has also been doing more aggressive fusion such as turning conditional-jump-over-a-single-instruction
> > > sequences into a single predicated µop (the instruction can be an add, and, or, xor plus some
> > > immediate forms as well as a store). See the POWER8 user manual section 10.1.4.7.
> > >
> > > Beside being different ISAs, IBM doesn't have many constraints as far as power
> > > goes so the tradeoffs they're making might not apply to Apple's design.
> >
> > The IBM branch over one instruction is neat, but, like you said
> > for forming immediates, it reflects a hole in the iSA.
> > I am guessing that Apple (and the whole ARM camp)'s answer to that is to use a conditional
> > select. And of course Apple, in particular, have very little of a legacy code problem...
> >
> > That does raise the issue that perhaps the next lowest lying fruit for ARM op-fusion
> > might be op+predicated move as a single unit? That and three input add seem like
> > the most common patterns left after compare and branch have been handled.
>
> Various ARM cores fuse cmp+bcc, mov+movk, adrp+ldr, adrp+add already. There aren't that many cases
> left that make sense - most fusions are already in the architecture (ldp, writeback, shift+alu etc),
> and you also need to form instructions that are simple enough not to need splitting later.
Do we know for a FACT that all the options you are suggesting are in use (as opposed to "the sensible ones to use")?
That was the point of my initial post --- that we have some sort of confirmation about what Apple is doing in Cyclone. As far as I know, beyond that, it's all speculation. For example A72 is supposed to have enhanced fusion, but I've seen nothing beyond that detail as to what is being fused (and how it's enhanced beyond what gets fused in A57).