By: Brett (ggtgp.delete@this.yahoo.com), June 7, 2022 6:04 pm
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on June 7, 2022 5:20 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on June 7, 2022 2:23 am wrote:
> > Adrian (a.delete@this.acm.org) on June 7, 2022 1:21 am wrote:
> > >
> > > "But someone could also have codes for common instruction sequences": yes, I agree.
> > >
> > > I believe that this is by far the most practical method of increasing the code
> > > density, i.e. to add complex instructions to the ISA, but only if they are well
> > > chosen, based on usage frequency, to be able to influence the code density.
> > >
> > > While in general I have an extremely poor opinion of the RISC V ISA, which
> > > I believe to be one of the worst of the more than 100 ISA with which I am
> > > familiar, the RISC V ISA nonetheless includes a few very good features.
> > >
> > > By far the best feature of RISC V are the combined compare-and-branch instructions, even
> > > if RISC V does not have all the comparison cases that would be needed in a complete ISA.
> >
> > Nios2 does have all reg-to-reg signed/unsigned comparison cases combined with branch.
> > According to my code size managements, the improvement over RV is not noticeable.
>
> I did not understand this sentence. If both Nios2 and RISC-V have combined
> compare-and-branch, why would Nios2 be expected to be better?
>
> By RISC-V not having all the required conditional branches I have not referred to the 6 compare-and-branch
> instructions needed for the simple relations between signed or unsigned integers, which are included
> in the RISC-V ISA, but to other strictly necessary conditional branches that are missing, like
> testing for integer overflow, and to some other nice to have extra combined conditional branches,
> like test-under-mask-and-branch and some tests useful for loop termination.
> > >
> > > Because of the very high frequency of the conditional branches and because for almost
> > > every such branch RISC-V saves a 32-bit word in comparison with AArch64, this allows the
> > > length of many RISC-V programs to be competitive with that for AArch64, even if RISC-V
> > > needs a lot of extra 32-bit words for a large part of the load/store instructions.
> > >
> > > In my opinion, the easiest way to increase the code density of AArch64 would be to define a set
> > > of compare-and-branch/test-under-mask-and-branch/branch-on-count/branch-on-index instructions.
> > >
> > > I have verified that there is enough free encoding space in the AArch64 branch instruction
> > > block, to allow the encoding of all kinds of such instructions that would be needed.
> > >
> > > Conditional branches are usually 15% to 20% of all instructions and saving one 32-bit word for most
> > > of them would cause a larger improvement in code density than almost any other encoding change.
> >
> > My guess is that very significant part of compare|test instructions ahead of branch are of
> > 'with immediate' variety. So, I expect much smaller gain than suggested by your analysis.
> > In specific case of aarch64, another, probably even bigger, factor that reduces
> > the potential gain is the fact ISA already has CBZ/CBNZ/TBZ/TBNZ.
>
> CBZ/CBNZ is not a compare-and-branch, despite the name. There
> are only few cases when it can reduce the code size.
>
> TBZ/TBNZ also reduces the code size only when it is enough to test a single bit, not a bit field.
>
> Both instructions are useful, but, at least in my experience, they can be used
> only much more seldom than compare-and-branch and test-under-mask-and-branch.
>
> > Generally, I disagree with your conclusion.
> > 2 instruction sizes (16b/32b) is bigger density win than
> > what is possible with very smart choice of combined
> > sequences. 3 instruction sizes (16b/32b/48b or 16b/32b/40b, I'm not close on which one is better) is
> > better yet, but by smaller increment over 2 sizes. The
> > biggest win of 3 sizes is not so much a code density,
> > but potentially better performance of narrow (1 or 2-wide) implementations of the ISA.
>
> My conclusion was valid only while keeping the constraint of a fixed-length encoding.
>
> I completely agree with you that a variable-length encoding using 16-bit multiples is certain to achieve a
> greater code density than any fixed-length encoding, even if the latter encodes some complex instructions.
>
> I also agree about the advantage for small implementations.
>
> However, for both purposes, ARM has Armv8-M and I do not think that there is any
> need in the near future to have an AArch64 version for such applications.
>
> On the other hand, improving the code density of AArch64 with minimal changes is useful. The
> latest AArch64 implementations already have fusion for the compare and branch instruction pairs,
> so encoding them in a single word would not change anything else, except the decoder.
Add/sub from memory would be the biggest gain in code size reduction combining instructions, and x86 has it.
Don’t know the current policy on generating such code or the restrictions for x86.
Like it’s a long instruction generally competing with a pair of short instructions and not much a win for x86. Clean sheet improves this.
Of course supporting add from memory is a die size and engineer time sink issue.
Adding a small belt is likely better.
> Michael S (already5chosen.delete@this.yahoo.com) on June 7, 2022 2:23 am wrote:
> > Adrian (a.delete@this.acm.org) on June 7, 2022 1:21 am wrote:
> > >
> > > "But someone could also have codes for common instruction sequences": yes, I agree.
> > >
> > > I believe that this is by far the most practical method of increasing the code
> > > density, i.e. to add complex instructions to the ISA, but only if they are well
> > > chosen, based on usage frequency, to be able to influence the code density.
> > >
> > > While in general I have an extremely poor opinion of the RISC V ISA, which
> > > I believe to be one of the worst of the more than 100 ISA with which I am
> > > familiar, the RISC V ISA nonetheless includes a few very good features.
> > >
> > > By far the best feature of RISC V are the combined compare-and-branch instructions, even
> > > if RISC V does not have all the comparison cases that would be needed in a complete ISA.
> >
> > Nios2 does have all reg-to-reg signed/unsigned comparison cases combined with branch.
> > According to my code size managements, the improvement over RV is not noticeable.
>
> I did not understand this sentence. If both Nios2 and RISC-V have combined
> compare-and-branch, why would Nios2 be expected to be better?
>
> By RISC-V not having all the required conditional branches I have not referred to the 6 compare-and-branch
> instructions needed for the simple relations between signed or unsigned integers, which are included
> in the RISC-V ISA, but to other strictly necessary conditional branches that are missing, like
> testing for integer overflow, and to some other nice to have extra combined conditional branches,
> like test-under-mask-and-branch and some tests useful for loop termination.
> > >
> > > Because of the very high frequency of the conditional branches and because for almost
> > > every such branch RISC-V saves a 32-bit word in comparison with AArch64, this allows the
> > > length of many RISC-V programs to be competitive with that for AArch64, even if RISC-V
> > > needs a lot of extra 32-bit words for a large part of the load/store instructions.
> > >
> > > In my opinion, the easiest way to increase the code density of AArch64 would be to define a set
> > > of compare-and-branch/test-under-mask-and-branch/branch-on-count/branch-on-index instructions.
> > >
> > > I have verified that there is enough free encoding space in the AArch64 branch instruction
> > > block, to allow the encoding of all kinds of such instructions that would be needed.
> > >
> > > Conditional branches are usually 15% to 20% of all instructions and saving one 32-bit word for most
> > > of them would cause a larger improvement in code density than almost any other encoding change.
> >
> > My guess is that very significant part of compare|test instructions ahead of branch are of
> > 'with immediate' variety. So, I expect much smaller gain than suggested by your analysis.
> > In specific case of aarch64, another, probably even bigger, factor that reduces
> > the potential gain is the fact ISA already has CBZ/CBNZ/TBZ/TBNZ.
>
> CBZ/CBNZ is not a compare-and-branch, despite the name. There
> are only few cases when it can reduce the code size.
>
> TBZ/TBNZ also reduces the code size only when it is enough to test a single bit, not a bit field.
>
> Both instructions are useful, but, at least in my experience, they can be used
> only much more seldom than compare-and-branch and test-under-mask-and-branch.
>
> > Generally, I disagree with your conclusion.
> > 2 instruction sizes (16b/32b) is bigger density win than
> > what is possible with very smart choice of combined
> > sequences. 3 instruction sizes (16b/32b/48b or 16b/32b/40b, I'm not close on which one is better) is
> > better yet, but by smaller increment over 2 sizes. The
> > biggest win of 3 sizes is not so much a code density,
> > but potentially better performance of narrow (1 or 2-wide) implementations of the ISA.
>
> My conclusion was valid only while keeping the constraint of a fixed-length encoding.
>
> I completely agree with you that a variable-length encoding using 16-bit multiples is certain to achieve a
> greater code density than any fixed-length encoding, even if the latter encodes some complex instructions.
>
> I also agree about the advantage for small implementations.
>
> However, for both purposes, ARM has Armv8-M and I do not think that there is any
> need in the near future to have an AArch64 version for such applications.
>
> On the other hand, improving the code density of AArch64 with minimal changes is useful. The
> latest AArch64 implementations already have fusion for the compare and branch instruction pairs,
> so encoding them in a single word would not change anything else, except the decoder.
Add/sub from memory would be the biggest gain in code size reduction combining instructions, and x86 has it.
Don’t know the current policy on generating such code or the restrictions for x86.
Like it’s a long instruction generally competing with a pair of short instructions and not much a win for x86. Clean sheet improves this.
Of course supporting add from memory is a die size and engineer time sink issue.
Adding a small belt is likely better.