By: Wilco (wilco.dijkstra.delete@this.ntlworld.com), November 3, 2019 12:09 pm
Room: Moderated Discussions
Ronald Maas (ronaldjmaas.delete@this.gmail.com) on November 3, 2019 7:58 am wrote:
> Wilco (wilco.dijkstra.delete@this.ntlworld.com) on November 2, 2019 11:20 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on November 2, 2019 10:49 am wrote:
> > > Adrian (a.delete@this.acm.org) on November 2, 2019 10:33 am wrote:
> > > > Ronald Maas (ronaldjmaas.delete@this.gmail.com) on November 2, 2019 10:02 am wrote:
> > > > > Heikki Kultala (heikki.kultala.delete@this.tuni.fi) on November 2, 2019 2:31 am wrote:
> > > > > > And then ARMv8 has smart non-riscy things like:
> > > > > >
> > > > > > * More advanced addressing modes
> > > > > > * Paired loads and stores
> > > > > >
> > > > > > Which can save huge amount of instructions.
> > > > > >
> > > > >
> > > > > Someone from the RISC-V team made a video comparing various ISAs
> > > > >
> > > > > The diagram on the 18:30 mark shows RV64G requires 10% more instructions compared to ARMv8.
> > > > >
> > > > > Not a huge difference in my opinion.
> > > > >
> > > > > Ronald
> > > > >
> > > >
> > > >
> > > >
> > > > That might be true for whole programs, but for critical loops it is not uncommon for RISC-V to need
> > > > a double number of instructions compared to better instruction sets, e.g. ARMv8, x86 or POWER.
> > > >
> > > > The answer to that of the RISC-V architects is that any high-performance
> > > > implementation of RISC-V must do instruction-pair fusion.
> > > >
> > > >
> > > > I refuse to believe to believe that instruction-pair fusion is simpler or better than the
> > > > trivial enhancement of the instruction encoding to cover the more complex instructions that
> > > > are needed in almost all loops, e.g. either with indexed or auto-indexed addressing.
> > > >
> > > >
> > >
> > > Their answer for everything is "just solve it in hardware because on high performance hardware you
> > > should be able to afford it anyway". They used the same argument
> > > for predication/cmov. On low end implementations
> > > the cost for implementing them would be better spent on branch prediction and on high end implementations
> > > the branch predictor should either be good enough to just predict the branch correctly anyway or simply
> > > detect when predication would make sense and dynamically transform those instructions into predicated
> > > uops as needed with the branch result as predicate. Super simple, right?
> >
> > Yes I don't buy the fusion story either. It's not exactly trivial to fuse the 3 instructions
> > needed for a load with shifted index. Codesize will be better by adding encodings for
> > real instructions. And all but the highest-end implementations won't do any fusion,
> > while they would benefit the most from having powerful instructions.
> >
> > It's seems like they are still stuck in 80's RISC dogma. That video for example incorrectly assumes
> > micro-ops on AArch64 only ever write one register. Modern Arm cores process 4 registers per cycle
> > in a load-multiple or 2 load-pairs per cycle... It makes sense to design a new ISA to match expected
> > capabilities of future implementations, not religiously apply an ancient 2R/1W rule.
> >
> > Wilco
>
> On the very low end having a very basic ISA like RISC-V can be a real difference maker
> compared to other other ISAs, because it saves on die space where it counts.
The size of a core is already insignificant at the low end. Flash, SRAM and peripherals take most of the die space of a microcontroller. A Cortex-M23 core takes 0.0037 mm^2 on 28HPC despite supporting multiply, division, load/store multiple etc.
Reducing the size of a core to the absolute bare minimum implies executing extra instructions, and thus a higher frequency to get the same performance, plus extra power to fetch those instructions and maybe even a larger flash if your code no longer fits. You have to look beyond the decoder and consider the whole system.
> On high end implementations, the number of transistors needed for branch prediction, out of order
> execution, uncore, etc. will dwarf for what is needed to implement instruction fusion.
Instruction fusion adds a non-zero cost even in a high-end implementation (for example extra decode cycles and thus higher mispredict latency), so an implementation of an ISA which doesn't need such fusion would still be faster.
> So in my opinion RISC-V designers made the right call here.
By only allowing the very high-end get any benefit from fusion? No other implementation could benefit from indexing, load multiple or even zero/sign-extend?
Wilco
> Wilco (wilco.dijkstra.delete@this.ntlworld.com) on November 2, 2019 11:20 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on November 2, 2019 10:49 am wrote:
> > > Adrian (a.delete@this.acm.org) on November 2, 2019 10:33 am wrote:
> > > > Ronald Maas (ronaldjmaas.delete@this.gmail.com) on November 2, 2019 10:02 am wrote:
> > > > > Heikki Kultala (heikki.kultala.delete@this.tuni.fi) on November 2, 2019 2:31 am wrote:
> > > > > > And then ARMv8 has smart non-riscy things like:
> > > > > >
> > > > > > * More advanced addressing modes
> > > > > > * Paired loads and stores
> > > > > >
> > > > > > Which can save huge amount of instructions.
> > > > > >
> > > > >
> > > > > Someone from the RISC-V team made a video comparing various ISAs
> > > > >
> > > > > The diagram on the 18:30 mark shows RV64G requires 10% more instructions compared to ARMv8.
> > > > >
> > > > > Not a huge difference in my opinion.
> > > > >
> > > > > Ronald
> > > > >
> > > >
> > > >
> > > >
> > > > That might be true for whole programs, but for critical loops it is not uncommon for RISC-V to need
> > > > a double number of instructions compared to better instruction sets, e.g. ARMv8, x86 or POWER.
> > > >
> > > > The answer to that of the RISC-V architects is that any high-performance
> > > > implementation of RISC-V must do instruction-pair fusion.
> > > >
> > > >
> > > > I refuse to believe to believe that instruction-pair fusion is simpler or better than the
> > > > trivial enhancement of the instruction encoding to cover the more complex instructions that
> > > > are needed in almost all loops, e.g. either with indexed or auto-indexed addressing.
> > > >
> > > >
> > >
> > > Their answer for everything is "just solve it in hardware because on high performance hardware you
> > > should be able to afford it anyway". They used the same argument
> > > for predication/cmov. On low end implementations
> > > the cost for implementing them would be better spent on branch prediction and on high end implementations
> > > the branch predictor should either be good enough to just predict the branch correctly anyway or simply
> > > detect when predication would make sense and dynamically transform those instructions into predicated
> > > uops as needed with the branch result as predicate. Super simple, right?
> >
> > Yes I don't buy the fusion story either. It's not exactly trivial to fuse the 3 instructions
> > needed for a load with shifted index. Codesize will be better by adding encodings for
> > real instructions. And all but the highest-end implementations won't do any fusion,
> > while they would benefit the most from having powerful instructions.
> >
> > It's seems like they are still stuck in 80's RISC dogma. That video for example incorrectly assumes
> > micro-ops on AArch64 only ever write one register. Modern Arm cores process 4 registers per cycle
> > in a load-multiple or 2 load-pairs per cycle... It makes sense to design a new ISA to match expected
> > capabilities of future implementations, not religiously apply an ancient 2R/1W rule.
> >
> > Wilco
>
> On the very low end having a very basic ISA like RISC-V can be a real difference maker
> compared to other other ISAs, because it saves on die space where it counts.
The size of a core is already insignificant at the low end. Flash, SRAM and peripherals take most of the die space of a microcontroller. A Cortex-M23 core takes 0.0037 mm^2 on 28HPC despite supporting multiply, division, load/store multiple etc.
Reducing the size of a core to the absolute bare minimum implies executing extra instructions, and thus a higher frequency to get the same performance, plus extra power to fetch those instructions and maybe even a larger flash if your code no longer fits. You have to look beyond the decoder and consider the whole system.
> On high end implementations, the number of transistors needed for branch prediction, out of order
> execution, uncore, etc. will dwarf for what is needed to implement instruction fusion.
Instruction fusion adds a non-zero cost even in a high-end implementation (for example extra decode cycles and thus higher mispredict latency), so an implementation of an ISA which doesn't need such fusion would still be faster.
> So in my opinion RISC-V designers made the right call here.
By only allowing the very high-end get any benefit from fusion? No other implementation could benefit from indexing, load multiple or even zero/sign-extend?
Wilco