By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), July 6, 2015 11:17 am
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on July 6, 2015 10:25 am wrote:
> Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on July 6, 2015 2:11 am wrote:
> > Maynard Handley (name99.delete@this.name99.org) on July 5, 2015 6:01 pm wrote:
> > > Interestingly I did not see any OTHER fusion possibilities in the code. In particular the possibility
> > > IBM selected (fusing instructions to create a large immediate) is not utilized, which might, of course
> > > reflect not enough time to add this to the design; but maybe also reflects something about ARM's constant
> > > generation and so a less frequent need for generating immediates through successive instructions.
> >
> > In POWER8's case instruction fusion applies to add immediate / add immediate
> > shifted + loads which covers a weakness in the ISA which might not apply to
> > ARM (lack of base + immediate addressing for certain types of load).
> >
> > IBM has also been doing more aggressive fusion such as turning conditional-jump-over-a-single-instruction
> > sequences into a single predicated µop (the instruction can be an add, and, or, xor plus some
> > immediate forms as well as a store). See the POWER8 user manual section 10.1.4.7.
> >
> > Beside being different ISAs, IBM doesn't have many constraints as far as power
> > goes so the tradeoffs they're making might not apply to Apple's design.
>
> The IBM branch over one instruction is neat, but, like you said
> for forming immediates, it reflects a hole in the iSA.
> I am guessing that Apple (and the whole ARM camp)'s answer to that is to use a conditional
> select. And of course Apple, in particular, have very little of a legacy code problem...
>
> That does raise the issue that perhaps the next lowest lying fruit for ARM op-fusion
> might be op+predicated move as a single unit? That and three input add seem like
> the most common patterns left after compare and branch have been handled.
Various ARM cores fuse cmp+bcc, mov+movk, adrp+ldr, adrp+add already. There aren't that many cases left that make sense - most fusions are already in the architecture (ldp, writeback, shift+alu etc), and you also need to form instructions that are simple enough not to need splitting later.
Wilco
> Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on July 6, 2015 2:11 am wrote:
> > Maynard Handley (name99.delete@this.name99.org) on July 5, 2015 6:01 pm wrote:
> > > Interestingly I did not see any OTHER fusion possibilities in the code. In particular the possibility
> > > IBM selected (fusing instructions to create a large immediate) is not utilized, which might, of course
> > > reflect not enough time to add this to the design; but maybe also reflects something about ARM's constant
> > > generation and so a less frequent need for generating immediates through successive instructions.
> >
> > In POWER8's case instruction fusion applies to add immediate / add immediate
> > shifted + loads which covers a weakness in the ISA which might not apply to
> > ARM (lack of base + immediate addressing for certain types of load).
> >
> > IBM has also been doing more aggressive fusion such as turning conditional-jump-over-a-single-instruction
> > sequences into a single predicated µop (the instruction can be an add, and, or, xor plus some
> > immediate forms as well as a store). See the POWER8 user manual section 10.1.4.7.
> >
> > Beside being different ISAs, IBM doesn't have many constraints as far as power
> > goes so the tradeoffs they're making might not apply to Apple's design.
>
> The IBM branch over one instruction is neat, but, like you said
> for forming immediates, it reflects a hole in the iSA.
> I am guessing that Apple (and the whole ARM camp)'s answer to that is to use a conditional
> select. And of course Apple, in particular, have very little of a legacy code problem...
>
> That does raise the issue that perhaps the next lowest lying fruit for ARM op-fusion
> might be op+predicated move as a single unit? That and three input add seem like
> the most common patterns left after compare and branch have been handled.
Various ARM cores fuse cmp+bcc, mov+movk, adrp+ldr, adrp+add already. There aren't that many cases left that make sense - most fusions are already in the architecture (ldp, writeback, shift+alu etc), and you also need to form instructions that are simple enough not to need splitting later.
Wilco