By: Maynard Handley (name99.delete@this.name99.org), October 2, 2015 9:45 am
Room: Moderated Discussions
Exophase (exophase.delete@this.gmail.com) on October 2, 2015 7:43 am wrote:
> Maynard Handley (name99.delete@this.name99.org) on October 2, 2015 1:10 am wrote:
> > "the only catch is supporting the two register destinations."
> > And that may be necessary anyway depending on how you support the "S-suffix" instructions (those that
> > also set the zero/overflow/etc flags). You can crack those this PPC did, but if you've designed them
> > properly (as I assume ARM did for v8, learning from PPC's mistakes) the natural high performance thing
> > would be to have a pool of renamed 4-bit flag registers, use the normal rename channels, and just accept
> > that some largish fraction (20% or so?) of your instructions are going to be two destination. (Once
> > you have this machinery, you may also be able to use it to fuse instruction pairs that are common but
> > each generate a separate output if there are cases where that's worth the hassle.)
> >
>
> Why use the normal rename channels for a separate register file where there's little
> cross-access with the main register file? It's the same thing for FP/SIMD, you're generally
> better off renaming to a different register file with separate write ports.
>
> Flag setting instructions on ARMv8 AArch64 are limited to ADCS, SBCS, ADDS, SUBS, ANDS, BICS, CCMP, and CCMN.
> Definitely nothing like 20% or so. If AArch32 is supported that opens up a much larger pool. But either way,
> CMP/TST like instructions are common enough that you really don't want to have to crack them in two.
My 20% referred to the full set of instructions that generate two results.
This is not just the S-suffix instructions but also the load-pairs and the load+address+update instructions.
In principle this can be learned (for all the ARM-v8 CPUs). Start by seeing whether you can execute one vs 2 load pairs/cycle in a tight loop, then stuff in extra unrelated arithmetic to see whether you can reach the width you expect, or get blocked by not enough destination register write ports. Same thing with load-update instructions and S-suffix instructions.
The one thing I could find on this is A72 on average executes 1.08 micro-ops per instruction, which seems low if the entire set I've suggested above are all cracked; but maybe it's plausible if, eg, load-update (and some other instructions we haven't mentioned) are cracked but S-suffix and load/store pair are not.
> Maynard Handley (name99.delete@this.name99.org) on October 2, 2015 1:10 am wrote:
> > "the only catch is supporting the two register destinations."
> > And that may be necessary anyway depending on how you support the "S-suffix" instructions (those that
> > also set the zero/overflow/etc flags). You can crack those this PPC did, but if you've designed them
> > properly (as I assume ARM did for v8, learning from PPC's mistakes) the natural high performance thing
> > would be to have a pool of renamed 4-bit flag registers, use the normal rename channels, and just accept
> > that some largish fraction (20% or so?) of your instructions are going to be two destination. (Once
> > you have this machinery, you may also be able to use it to fuse instruction pairs that are common but
> > each generate a separate output if there are cases where that's worth the hassle.)
> >
>
> Why use the normal rename channels for a separate register file where there's little
> cross-access with the main register file? It's the same thing for FP/SIMD, you're generally
> better off renaming to a different register file with separate write ports.
>
> Flag setting instructions on ARMv8 AArch64 are limited to ADCS, SBCS, ADDS, SUBS, ANDS, BICS, CCMP, and CCMN.
> Definitely nothing like 20% or so. If AArch32 is supported that opens up a much larger pool. But either way,
> CMP/TST like instructions are common enough that you really don't want to have to crack them in two.
My 20% referred to the full set of instructions that generate two results.
This is not just the S-suffix instructions but also the load-pairs and the load+address+update instructions.
In principle this can be learned (for all the ARM-v8 CPUs). Start by seeing whether you can execute one vs 2 load pairs/cycle in a tight loop, then stuff in extra unrelated arithmetic to see whether you can reach the width you expect, or get blocked by not enough destination register write ports. Same thing with load-update instructions and S-suffix instructions.
The one thing I could find on this is A72 on average executes 1.08 micro-ops per instruction, which seems low if the entire set I've suggested above are all cracked; but maybe it's plausible if, eg, load-update (and some other instructions we haven't mentioned) are cracked but S-suffix and load/store pair are not.