By: anon (spam.delete.delete@this.this.spam.com), November 2, 2019 1:57 pm
Room: Moderated Discussions
Wilco (wilco.dijkstra.delete@this.ntlworld.com) on November 2, 2019 12:20 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on November 2, 2019 10:49 am wrote:
> > Adrian (a.delete@this.acm.org) on November 2, 2019 10:33 am wrote:
> > > Ronald Maas (ronaldjmaas.delete@this.gmail.com) on November 2, 2019 10:02 am wrote:
> > > > Heikki Kultala (heikki.kultala.delete@this.tuni.fi) on November 2, 2019 2:31 am wrote:
> > > > > And then ARMv8 has smart non-riscy things like:
> > > > >
> > > > > * More advanced addressing modes
> > > > > * Paired loads and stores
> > > > >
> > > > > Which can save huge amount of instructions.
> > > > >
> > > >
> > > > Someone from the RISC-V team made a video comparing various ISAs
> > > >
> > > > The diagram on the 18:30 mark shows RV64G requires 10% more instructions compared to ARMv8.
> > > >
> > > > Not a huge difference in my opinion.
> > > >
> > > > Ronald
> > > >
> > >
> > >
> > >
> > > That might be true for whole programs, but for critical loops it is not uncommon for RISC-V to need
> > > a double number of instructions compared to better instruction sets, e.g. ARMv8, x86 or POWER.
> > >
> > > The answer to that of the RISC-V architects is that any high-performance
> > > implementation of RISC-V must do instruction-pair fusion.
> > >
> > >
> > > I refuse to believe to believe that instruction-pair fusion is simpler or better than the
> > > trivial enhancement of the instruction encoding to cover the more complex instructions that
> > > are needed in almost all loops, e.g. either with indexed or auto-indexed addressing.
> > >
> > >
> >
> > Their answer for everything is "just solve it in hardware because on high performance hardware you
> > should be able to afford it anyway". They used the same argument
> > for predication/cmov. On low end implementations
> > the cost for implementing them would be better spent on branch prediction and on high end implementations
> > the branch predictor should either be good enough to just predict the branch correctly anyway or simply
> > detect when predication would make sense and dynamically transform those instructions into predicated
> > uops as needed with the branch result as predicate. Super simple, right?
>
> Yes I don't buy the fusion story either. It's not exactly trivial to fuse the 3 instructions
> needed for a load with shifted index. Codesize will be better by adding encodings for
> real instructions. And all but the highest-end implementations won't do any fusion,
> while they would benefit the most from having powerful instructions.
>
> It's seems like they are still stuck in 80's RISC dogma. That video for example incorrectly assumes
> micro-ops on AArch64 only ever write one register. Modern Arm cores process 4 registers per cycle
> in a load-multiple or 2 load-pairs per cycle... It makes sense to design a new ISA to match expected
> capabilities of future implementations, not religiously apply an ancient 2R/1W rule.
>
> Wilco
There's also this overly simplistic rule of everything being either so extremely low end that implementing any adressing modes would have noticeable cost or that implementing cmov would cut into the transistor budget for the branch predictor, or that it's so high end that you can afford absolutely everything, even stuff that hasn't been implemented outside of academic papers yet.
Given that pretty much everything beyond the absolute bare bones (even mul/div) is an extension in RISC-V and that the fact that they're doing an even further cut down embedded version would it really have killed them to leave room for more adressing modes, cmov/predication and other things? If they're optional you can't complain about the cost. Because I believe there exists something between the low and the high end, where a 3R/1W renamer is not outrageously expensive, but a metric shit ton of circuitry to detect and use all sorts of fusion and predication opportunities still is.
> anon (spam.delete.delete@this.this.spam.com) on November 2, 2019 10:49 am wrote:
> > Adrian (a.delete@this.acm.org) on November 2, 2019 10:33 am wrote:
> > > Ronald Maas (ronaldjmaas.delete@this.gmail.com) on November 2, 2019 10:02 am wrote:
> > > > Heikki Kultala (heikki.kultala.delete@this.tuni.fi) on November 2, 2019 2:31 am wrote:
> > > > > And then ARMv8 has smart non-riscy things like:
> > > > >
> > > > > * More advanced addressing modes
> > > > > * Paired loads and stores
> > > > >
> > > > > Which can save huge amount of instructions.
> > > > >
> > > >
> > > > Someone from the RISC-V team made a video comparing various ISAs
> > > >
> > > > The diagram on the 18:30 mark shows RV64G requires 10% more instructions compared to ARMv8.
> > > >
> > > > Not a huge difference in my opinion.
> > > >
> > > > Ronald
> > > >
> > >
> > >
> > >
> > > That might be true for whole programs, but for critical loops it is not uncommon for RISC-V to need
> > > a double number of instructions compared to better instruction sets, e.g. ARMv8, x86 or POWER.
> > >
> > > The answer to that of the RISC-V architects is that any high-performance
> > > implementation of RISC-V must do instruction-pair fusion.
> > >
> > >
> > > I refuse to believe to believe that instruction-pair fusion is simpler or better than the
> > > trivial enhancement of the instruction encoding to cover the more complex instructions that
> > > are needed in almost all loops, e.g. either with indexed or auto-indexed addressing.
> > >
> > >
> >
> > Their answer for everything is "just solve it in hardware because on high performance hardware you
> > should be able to afford it anyway". They used the same argument
> > for predication/cmov. On low end implementations
> > the cost for implementing them would be better spent on branch prediction and on high end implementations
> > the branch predictor should either be good enough to just predict the branch correctly anyway or simply
> > detect when predication would make sense and dynamically transform those instructions into predicated
> > uops as needed with the branch result as predicate. Super simple, right?
>
> Yes I don't buy the fusion story either. It's not exactly trivial to fuse the 3 instructions
> needed for a load with shifted index. Codesize will be better by adding encodings for
> real instructions. And all but the highest-end implementations won't do any fusion,
> while they would benefit the most from having powerful instructions.
>
> It's seems like they are still stuck in 80's RISC dogma. That video for example incorrectly assumes
> micro-ops on AArch64 only ever write one register. Modern Arm cores process 4 registers per cycle
> in a load-multiple or 2 load-pairs per cycle... It makes sense to design a new ISA to match expected
> capabilities of future implementations, not religiously apply an ancient 2R/1W rule.
>
> Wilco
There's also this overly simplistic rule of everything being either so extremely low end that implementing any adressing modes would have noticeable cost or that implementing cmov would cut into the transistor budget for the branch predictor, or that it's so high end that you can afford absolutely everything, even stuff that hasn't been implemented outside of academic papers yet.
Given that pretty much everything beyond the absolute bare bones (even mul/div) is an extension in RISC-V and that the fact that they're doing an even further cut down embedded version would it really have killed them to leave room for more adressing modes, cmov/predication and other things? If they're optional you can't complain about the cost. Because I believe there exists something between the low and the high end, where a 3R/1W renamer is not outrageously expensive, but a metric shit ton of circuitry to detect and use all sorts of fusion and predication opportunities still is.