By: Maynard Handley (name99.delete@this.name99.org), July 6, 2015 12:55 pm
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on July 6, 2015 12:07 pm wrote:
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on July 6, 2015 11:17 am wrote:
> > Maynard Handley (name99.delete@this.name99.org) on July 6, 2015 10:25 am wrote:
> > > Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on July 6, 2015 2:11 am wrote:
> > > > Maynard Handley (name99.delete@this.name99.org) on July 5, 2015 6:01 pm wrote:
> > > > > Interestingly I did not see any OTHER fusion possibilities in the code. In particular the possibility
> > > > > IBM selected (fusing instructions to create a large immediate) is not utilized, which might, of course
> > > > > reflect not enough time to add this to the design; but maybe also reflects something about ARM's constant
> > > > > generation and so a less frequent need for generating immediates through successive instructions.
> > > >
> > > > In POWER8's case instruction fusion applies to add immediate / add immediate
> > > > shifted + loads which covers a weakness in the ISA which might not apply to
> > > > ARM (lack of base + immediate addressing for certain types of load).
> > > >
> > > > IBM has also been doing more aggressive fusion such as turning conditional-jump-over-a-single-instruction
> > > > sequences into a single predicated µop (the instruction can be an add, and, or, xor plus some
> > > > immediate forms as well as a store). See the POWER8 user manual section 10.1.4.7.
> > > >
> > > > Beside being different ISAs, IBM doesn't have many constraints as far as power
> > > > goes so the tradeoffs they're making might not apply to Apple's design.
> > >
> > > The IBM branch over one instruction is neat, but, like you said
> > > for forming immediates, it reflects a hole in the iSA.
> > > I am guessing that Apple (and the whole ARM camp)'s answer to that is to use a conditional
> > > select. And of course Apple, in particular, have very little of a legacy code problem...
> > >
> > > That does raise the issue that perhaps the next lowest lying fruit for ARM op-fusion
> > > might be op+predicated move as a single unit? That and three input add seem like
> > > the most common patterns left after compare and branch have been handled.
> >
> > Various ARM cores fuse cmp+bcc, mov+movk, adrp+ldr, adrp+add already. There aren't that many cases
> > left that make sense - most fusions are already in the architecture (ldp, writeback, shift+alu etc),
> > and you also need to form instructions that are simple enough not to need splitting later.
>
> Do we know for a FACT that all the options you are suggesting
> are in use (as opposed to "the sensible ones to use")?
> That was the point of my initial post --- that we have some sort of confirmation about
> what Apple is doing in Cyclone. As far as I know, beyond that, it's all speculation. For
> example A72 is supposed to have enhanced fusion, but I've seen nothing beyond that detail
> as to what is being fused (and how it's enhanced beyond what gets fused in A57).
OK, going through the A57 optimization manual
http://infocenter.arm.com/help/topic/com.arm.doc.uan0015a/cortex_a57_software_optimisation_guide_external.pdf
I see no EXPLICIT reference to op fusion, but we are told that
mov+movk (and similar pairs) and
adrp+add
are "optimized" sequences, which I assume means they are fused. (adrp+ldr is not mentioned)
Seems that A57 is not using the compare+branch fusion, but maybe that's coming in A72?
I can't tell from the LLVM sources whether Cyclone implements these or not.
The sources seem to indicate (though they don't draw attention to the fact) that when one wants to translate the appropriate abstract op (eg load a 32bit imm, or perform a certain type of large offset load) on ANY ARMv8 core one generates these optimal sequences (which is kinda obvious, what else would you do), and I don't know enough about LLVM internals to know if there's a flag being set to glue these pairs together so that they are never subsequently split/rearranged.
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on July 6, 2015 11:17 am wrote:
> > Maynard Handley (name99.delete@this.name99.org) on July 6, 2015 10:25 am wrote:
> > > Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on July 6, 2015 2:11 am wrote:
> > > > Maynard Handley (name99.delete@this.name99.org) on July 5, 2015 6:01 pm wrote:
> > > > > Interestingly I did not see any OTHER fusion possibilities in the code. In particular the possibility
> > > > > IBM selected (fusing instructions to create a large immediate) is not utilized, which might, of course
> > > > > reflect not enough time to add this to the design; but maybe also reflects something about ARM's constant
> > > > > generation and so a less frequent need for generating immediates through successive instructions.
> > > >
> > > > In POWER8's case instruction fusion applies to add immediate / add immediate
> > > > shifted + loads which covers a weakness in the ISA which might not apply to
> > > > ARM (lack of base + immediate addressing for certain types of load).
> > > >
> > > > IBM has also been doing more aggressive fusion such as turning conditional-jump-over-a-single-instruction
> > > > sequences into a single predicated µop (the instruction can be an add, and, or, xor plus some
> > > > immediate forms as well as a store). See the POWER8 user manual section 10.1.4.7.
> > > >
> > > > Beside being different ISAs, IBM doesn't have many constraints as far as power
> > > > goes so the tradeoffs they're making might not apply to Apple's design.
> > >
> > > The IBM branch over one instruction is neat, but, like you said
> > > for forming immediates, it reflects a hole in the iSA.
> > > I am guessing that Apple (and the whole ARM camp)'s answer to that is to use a conditional
> > > select. And of course Apple, in particular, have very little of a legacy code problem...
> > >
> > > That does raise the issue that perhaps the next lowest lying fruit for ARM op-fusion
> > > might be op+predicated move as a single unit? That and three input add seem like
> > > the most common patterns left after compare and branch have been handled.
> >
> > Various ARM cores fuse cmp+bcc, mov+movk, adrp+ldr, adrp+add already. There aren't that many cases
> > left that make sense - most fusions are already in the architecture (ldp, writeback, shift+alu etc),
> > and you also need to form instructions that are simple enough not to need splitting later.
>
> Do we know for a FACT that all the options you are suggesting
> are in use (as opposed to "the sensible ones to use")?
> That was the point of my initial post --- that we have some sort of confirmation about
> what Apple is doing in Cyclone. As far as I know, beyond that, it's all speculation. For
> example A72 is supposed to have enhanced fusion, but I've seen nothing beyond that detail
> as to what is being fused (and how it's enhanced beyond what gets fused in A57).
OK, going through the A57 optimization manual
http://infocenter.arm.com/help/topic/com.arm.doc.uan0015a/cortex_a57_software_optimisation_guide_external.pdf
I see no EXPLICIT reference to op fusion, but we are told that
mov+movk (and similar pairs) and
adrp+add
are "optimized" sequences, which I assume means they are fused. (adrp+ldr is not mentioned)
Seems that A57 is not using the compare+branch fusion, but maybe that's coming in A72?
I can't tell from the LLVM sources whether Cyclone implements these or not.
The sources seem to indicate (though they don't draw attention to the fact) that when one wants to translate the appropriate abstract op (eg load a 32bit imm, or perform a certain type of large offset load) on ANY ARMv8 core one generates these optimal sequences (which is kinda obvious, what else would you do), and I don't know enough about LLVM internals to know if there's a flag being set to glue these pairs together so that they are never subsequently split/rearranged.