By: Wilco (wilco.dijkstra.delete@this.ntlworld.com), November 2, 2019 12:20 pm
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on November 2, 2019 10:49 am wrote:
> Adrian (a.delete@this.acm.org) on November 2, 2019 10:33 am wrote:
> > Ronald Maas (ronaldjmaas.delete@this.gmail.com) on November 2, 2019 10:02 am wrote:
> > > Heikki Kultala (heikki.kultala.delete@this.tuni.fi) on November 2, 2019 2:31 am wrote:
> > > > And then ARMv8 has smart non-riscy things like:
> > > >
> > > > * More advanced addressing modes
> > > > * Paired loads and stores
> > > >
> > > > Which can save huge amount of instructions.
> > > >
> > >
> > > Someone from the RISC-V team made a video comparing various ISAs
> > >
> > > The diagram on the 18:30 mark shows RV64G requires 10% more instructions compared to ARMv8.
> > >
> > > Not a huge difference in my opinion.
> > >
> > > Ronald
> > >
> >
> >
> >
> > That might be true for whole programs, but for critical loops it is not uncommon for RISC-V to need
> > a double number of instructions compared to better instruction sets, e.g. ARMv8, x86 or POWER.
> >
> > The answer to that of the RISC-V architects is that any high-performance
> > implementation of RISC-V must do instruction-pair fusion.
> >
> >
> > I refuse to believe to believe that instruction-pair fusion is simpler or better than the
> > trivial enhancement of the instruction encoding to cover the more complex instructions that
> > are needed in almost all loops, e.g. either with indexed or auto-indexed addressing.
> >
> >
>
> Their answer for everything is "just solve it in hardware because on high performance hardware you
> should be able to afford it anyway". They used the same argument for predication/cmov. On low end implementations
> the cost for implementing them would be better spent on branch prediction and on high end implementations
> the branch predictor should either be good enough to just predict the branch correctly anyway or simply
> detect when predication would make sense and dynamically transform those instructions into predicated
> uops as needed with the branch result as predicate. Super simple, right?
Yes I don't buy the fusion story either. It's not exactly trivial to fuse the 3 instructions needed for a load with shifted index. Codesize will be better by adding encodings for real instructions. And all but the highest-end implementations won't do any fusion, while they would benefit the most from having powerful instructions.
It's seems like they are still stuck in 80's RISC dogma. That video for example incorrectly assumes micro-ops on AArch64 only ever write one register. Modern Arm cores process 4 registers per cycle in a load-multiple or 2 load-pairs per cycle... It makes sense to design a new ISA to match expected capabilities of future implementations, not religiously apply an ancient 2R/1W rule.
Wilco
> Adrian (a.delete@this.acm.org) on November 2, 2019 10:33 am wrote:
> > Ronald Maas (ronaldjmaas.delete@this.gmail.com) on November 2, 2019 10:02 am wrote:
> > > Heikki Kultala (heikki.kultala.delete@this.tuni.fi) on November 2, 2019 2:31 am wrote:
> > > > And then ARMv8 has smart non-riscy things like:
> > > >
> > > > * More advanced addressing modes
> > > > * Paired loads and stores
> > > >
> > > > Which can save huge amount of instructions.
> > > >
> > >
> > > Someone from the RISC-V team made a video comparing various ISAs
> > >
> > > The diagram on the 18:30 mark shows RV64G requires 10% more instructions compared to ARMv8.
> > >
> > > Not a huge difference in my opinion.
> > >
> > > Ronald
> > >
> >
> >
> >
> > That might be true for whole programs, but for critical loops it is not uncommon for RISC-V to need
> > a double number of instructions compared to better instruction sets, e.g. ARMv8, x86 or POWER.
> >
> > The answer to that of the RISC-V architects is that any high-performance
> > implementation of RISC-V must do instruction-pair fusion.
> >
> >
> > I refuse to believe to believe that instruction-pair fusion is simpler or better than the
> > trivial enhancement of the instruction encoding to cover the more complex instructions that
> > are needed in almost all loops, e.g. either with indexed or auto-indexed addressing.
> >
> >
>
> Their answer for everything is "just solve it in hardware because on high performance hardware you
> should be able to afford it anyway". They used the same argument for predication/cmov. On low end implementations
> the cost for implementing them would be better spent on branch prediction and on high end implementations
> the branch predictor should either be good enough to just predict the branch correctly anyway or simply
> detect when predication would make sense and dynamically transform those instructions into predicated
> uops as needed with the branch result as predicate. Super simple, right?
Yes I don't buy the fusion story either. It's not exactly trivial to fuse the 3 instructions needed for a load with shifted index. Codesize will be better by adding encodings for real instructions. And all but the highest-end implementations won't do any fusion, while they would benefit the most from having powerful instructions.
It's seems like they are still stuck in 80's RISC dogma. That video for example incorrectly assumes micro-ops on AArch64 only ever write one register. Modern Arm cores process 4 registers per cycle in a load-multiple or 2 load-pairs per cycle... It makes sense to design a new ISA to match expected capabilities of future implementations, not religiously apply an ancient 2R/1W rule.
Wilco