By: noko (noko.delete@this.noko.com), October 2, 2015 5:19 pm
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on October 2, 2015 10:37 am wrote:
> Megol (golem960.delete@this.gmail.com) on October 2, 2015 4:39 am wrote:
> > Maynard Handley (name99.delete@this.name99.org) on October 2, 2015 1:10 am wrote:
> > > Exophase (exophase.delete@this.gmail.com) on October 1, 2015 11:07 pm wrote:
> > > > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > > > > For example, what happens if the pair loads target different pages?
> > > > > You'd need to do two separate translations through the TLB.
> > > >
> > > > You make this relatively unusual case take 2-3 cycles.
> > > >
> > > > Having a similar penalty for cacheline-crossing loads is still not a huge detriment. Even if you limited
> > > > single-cycle performance to naturally aligned boundaries you would still get significantly better bang
> > > > for your buck vs not having the instruction at all. And since you need decent 64/128-bit load support
> > > > for SIMD it's kind of a given, the only catch is supporting the two register destinations.
> > >
> > > "the only catch is supporting the two register destinations."
> > > And that may be necessary anyway depending on how you support the "S-suffix" instructions (those that
> > > also set the zero/overflow/etc flags). You can crack those this PPC did, but if you've designed them
> > > properly (as I assume ARM did for v8, learning from PPC's mistakes) the natural high performance thing
> > > would be to have a pool of renamed 4-bit flag registers, use the normal rename channels, and just accept
> > > that some largish fraction (20% or so?) of your instructions are going to be two destination. (Once
> > > you have this machinery, you may also be able to use it to fuse instruction pairs that are common but
> > > each generate a separate output if there are cases where that's worth the hassle.)
> >
> > Why would your scheme be more "proper" than the PPC one?
> > Increasing register targets per instruction is expensive, not just with register write ports
> > (not that a power optimized design is likely to support the theoretical register write peak,
> > register port reduction techniques are common) but also in the bypass network etc.
>
> More "proper" = "higher performance".
> Another set of instructions that generate two destinations is
> the various load-store-with-address-update instructions.
Based on the Cortex-A57 optimization guide, writeback forms of load/stores do indeed generate an additional uop. LDP, however, appears to be only one load uop except for the vector Q-form that access 256-bits, which is 2 uop. Additionally, the armv7 vector permute instructions that have 2 register outputs appear to be 1 uop for the D-form, and 3 uop for Q-form.
So it appears that Cortex-A57 and A72 uops support multiple register destinations.
> > I don't remember if AARCH64 support split conditions (parts of the flag result come from
> > several instructions) but if they did their lesson and don't one of the most inexpensive
> > ways to handle condition flags are attaching them to registers. Then the complications evaporate
> > with very little extra state (n extra bits per physical register for flags, keeping track
> > of the register storing the current condition in the renamer) and little overhead.
arm64 does indeed have no instructions that cause partial flag updates. A32/T32 have instructions that only update a subset of the flags.
> Megol (golem960.delete@this.gmail.com) on October 2, 2015 4:39 am wrote:
> > Maynard Handley (name99.delete@this.name99.org) on October 2, 2015 1:10 am wrote:
> > > Exophase (exophase.delete@this.gmail.com) on October 1, 2015 11:07 pm wrote:
> > > > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > > > > For example, what happens if the pair loads target different pages?
> > > > > You'd need to do two separate translations through the TLB.
> > > >
> > > > You make this relatively unusual case take 2-3 cycles.
> > > >
> > > > Having a similar penalty for cacheline-crossing loads is still not a huge detriment. Even if you limited
> > > > single-cycle performance to naturally aligned boundaries you would still get significantly better bang
> > > > for your buck vs not having the instruction at all. And since you need decent 64/128-bit load support
> > > > for SIMD it's kind of a given, the only catch is supporting the two register destinations.
> > >
> > > "the only catch is supporting the two register destinations."
> > > And that may be necessary anyway depending on how you support the "S-suffix" instructions (those that
> > > also set the zero/overflow/etc flags). You can crack those this PPC did, but if you've designed them
> > > properly (as I assume ARM did for v8, learning from PPC's mistakes) the natural high performance thing
> > > would be to have a pool of renamed 4-bit flag registers, use the normal rename channels, and just accept
> > > that some largish fraction (20% or so?) of your instructions are going to be two destination. (Once
> > > you have this machinery, you may also be able to use it to fuse instruction pairs that are common but
> > > each generate a separate output if there are cases where that's worth the hassle.)
> >
> > Why would your scheme be more "proper" than the PPC one?
> > Increasing register targets per instruction is expensive, not just with register write ports
> > (not that a power optimized design is likely to support the theoretical register write peak,
> > register port reduction techniques are common) but also in the bypass network etc.
>
> More "proper" = "higher performance".
> Another set of instructions that generate two destinations is
> the various load-store-with-address-update instructions.
Based on the Cortex-A57 optimization guide, writeback forms of load/stores do indeed generate an additional uop. LDP, however, appears to be only one load uop except for the vector Q-form that access 256-bits, which is 2 uop. Additionally, the armv7 vector permute instructions that have 2 register outputs appear to be 1 uop for the D-form, and 3 uop for Q-form.
So it appears that Cortex-A57 and A72 uops support multiple register destinations.
> > I don't remember if AARCH64 support split conditions (parts of the flag result come from
> > several instructions) but if they did their lesson and don't one of the most inexpensive
> > ways to handle condition flags are attaching them to registers. Then the complications evaporate
> > with very little extra state (n extra bits per physical register for flags, keeping track
> > of the register storing the current condition in the renamer) and little overhead.
arm64 does indeed have no instructions that cause partial flag updates. A32/T32 have instructions that only update a subset of the flags.