By: Exophase (exophase.delete@this.gmail.com), October 2, 2015 10:23 am
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on October 2, 2015 10:45 am wrote:
> My 20% referred to the full set of instructions that generate two results.
> This is not just the S-suffix instructions but also the
> load-pairs and the load+address+update instructions.
>
> In principle this can be learned (for all the ARM-v8 CPUs). Start by seeing whether you can
> execute one vs 2 load pairs/cycle in a tight loop, then stuff in extra unrelated arithmetic
> to see whether you can reach the width you expect, or get blocked by not enough destination
> register write ports. Same thing with load-update instructions and S-suffix instructions.
> The one thing I could find on this is A72 on average executes 1.08 micro-ops per instruction, which seems
> low if the entire set I've suggested above are all cracked; but maybe it's plausible if, eg, load-update
> (and some other instructions we haven't mentioned) are cracked but S-suffix and load/store pair are not.
Again, you shouldn't conflate flags generating instructions with instructions that modify two general purpose registers. Because typical uarchs won't implement them the same way. Nor is a flags source considered an extra input, ie CSEL doesn't have the same implementation implications as MADD.
But even if you did put all instructions that modify flags in the same category as instructions that modify two registers I doubt you'd come anywhere close to 20%, although it depends on how you count instructions exactly.
> My 20% referred to the full set of instructions that generate two results.
> This is not just the S-suffix instructions but also the
> load-pairs and the load+address+update instructions.
>
> In principle this can be learned (for all the ARM-v8 CPUs). Start by seeing whether you can
> execute one vs 2 load pairs/cycle in a tight loop, then stuff in extra unrelated arithmetic
> to see whether you can reach the width you expect, or get blocked by not enough destination
> register write ports. Same thing with load-update instructions and S-suffix instructions.
> The one thing I could find on this is A72 on average executes 1.08 micro-ops per instruction, which seems
> low if the entire set I've suggested above are all cracked; but maybe it's plausible if, eg, load-update
> (and some other instructions we haven't mentioned) are cracked but S-suffix and load/store pair are not.
Again, you shouldn't conflate flags generating instructions with instructions that modify two general purpose registers. Because typical uarchs won't implement them the same way. Nor is a flags source considered an extra input, ie CSEL doesn't have the same implementation implications as MADD.
But even if you did put all instructions that modify flags in the same category as instructions that modify two registers I doubt you'd come anywhere close to 20%, although it depends on how you count instructions exactly.