By: Megol (golem960.delete@this.gmail.com), October 2, 2015 3:39 am
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on October 2, 2015 1:10 am wrote:
> Exophase (exophase.delete@this.gmail.com) on October 1, 2015 11:07 pm wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > > For example, what happens if the pair loads target different pages?
> > > You'd need to do two separate translations through the TLB.
> >
> > You make this relatively unusual case take 2-3 cycles.
> >
> > Having a similar penalty for cacheline-crossing loads is still not a huge detriment. Even if you limited
> > single-cycle performance to naturally aligned boundaries you would still get significantly better bang
> > for your buck vs not having the instruction at all. And since you need decent 64/128-bit load support
> > for SIMD it's kind of a given, the only catch is supporting the two register destinations.
>
> "the only catch is supporting the two register destinations."
> And that may be necessary anyway depending on how you support the "S-suffix" instructions (those that
> also set the zero/overflow/etc flags). You can crack those this PPC did, but if you've designed them
> properly (as I assume ARM did for v8, learning from PPC's mistakes) the natural high performance thing
> would be to have a pool of renamed 4-bit flag registers, use the normal rename channels, and just accept
> that some largish fraction (20% or so?) of your instructions are going to be two destination. (Once
> you have this machinery, you may also be able to use it to fuse instruction pairs that are common but
> each generate a separate output if there are cases where that's worth the hassle.)
Why would your scheme be more "proper" than the PPC one?
Increasing register targets per instruction is expensive, not just with register write ports (not that a power optimized design is likely to support the theoretical register write peak, register port reduction techniques are common) but also in the bypass network etc.
I don't remember if AARCH64 support split conditions (parts of the flag result come from several instructions) but if they did their lesson and don't one of the most inexpensive ways to handle condition flags are attaching them to registers. Then the complications evaporate with very little extra state (n extra bits per physical register for flags, keeping track of the register storing the current condition in the renamer) and little overhead.
But for load pair instructions? The reasonable way to handle them is cracking at decode.
> Exophase (exophase.delete@this.gmail.com) on October 1, 2015 11:07 pm wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > > For example, what happens if the pair loads target different pages?
> > > You'd need to do two separate translations through the TLB.
> >
> > You make this relatively unusual case take 2-3 cycles.
> >
> > Having a similar penalty for cacheline-crossing loads is still not a huge detriment. Even if you limited
> > single-cycle performance to naturally aligned boundaries you would still get significantly better bang
> > for your buck vs not having the instruction at all. And since you need decent 64/128-bit load support
> > for SIMD it's kind of a given, the only catch is supporting the two register destinations.
>
> "the only catch is supporting the two register destinations."
> And that may be necessary anyway depending on how you support the "S-suffix" instructions (those that
> also set the zero/overflow/etc flags). You can crack those this PPC did, but if you've designed them
> properly (as I assume ARM did for v8, learning from PPC's mistakes) the natural high performance thing
> would be to have a pool of renamed 4-bit flag registers, use the normal rename channels, and just accept
> that some largish fraction (20% or so?) of your instructions are going to be two destination. (Once
> you have this machinery, you may also be able to use it to fuse instruction pairs that are common but
> each generate a separate output if there are cases where that's worth the hassle.)
Why would your scheme be more "proper" than the PPC one?
Increasing register targets per instruction is expensive, not just with register write ports (not that a power optimized design is likely to support the theoretical register write peak, register port reduction techniques are common) but also in the bypass network etc.
I don't remember if AARCH64 support split conditions (parts of the flag result come from several instructions) but if they did their lesson and don't one of the most inexpensive ways to handle condition flags are attaching them to registers. Then the complications evaporate with very little extra state (n extra bits per physical register for flags, keeping track of the register storing the current condition in the renamer) and little overhead.
But for load pair instructions? The reasonable way to handle them is cracking at decode.