By: Wilco (wilco.dijkstra.delete@this.ntlworld.com), November 4, 2019 4:01 pm
Room: Moderated Discussions
Ronald Maas (ronaldjmaas.delete@this.gmail.com) on November 3, 2019 9:44 pm wrote:
> Wilco (wilco.dijkstra.delete@this.ntlworld.com) on November 3, 2019 11:09 am wrote:
> > The size of a core is already insignificant at the low end. Flash, SRAM and peripherals
> > take most of the die space of a microcontroller. A Cortex-M23 core takes 0.0037 mm^2
> > on 28HPC despite supporting multiply, division, load/store multiple etc.
> >
> Footprint of a core is not so insignificant when using older process technologies or FPGA soft cores.
This is a die shot of a low-end 180nm Cortex-M3 microcontroller. The core takes less than 4% of the die (a Cortex-M23 would be about half that). Saving another 10-20% from the core size doesn't help the bottom line since it will lower performance and increase power consumption, making the device less competitive.
> > Reducing the size of a core to the absolute bare minimum
> > implies executing extra instructions, and thus a higher
> > frequency to get the same performance, plus extra power to fetch those instructions and maybe even a larger
> > flash if your code no longer fits. You have to look beyond the decoder and consider the whole system.
> >
> > > On high end implementations, the number of transistors needed for branch prediction, out of order
> > > execution, uncore, etc. will dwarf for what is needed to implement instruction fusion.
> >
> > Instruction fusion adds a non-zero cost even in a high-end implementation (for
> > example extra decode cycles and thus higher mispredict latency), so an implementation
> > of an ISA which doesn't need such fusion would still be faster.
> >
> It has been a while since I watched that video. But if I remember correctly, fusion works
> by treating two 16-bit instructions as a single 32-bit instruction. The decode will be more
> complex but this approach would not necessary require any additional decode cycles.
That sounds like a bad idea since you now have to check many more bits before you know how to decode an instruction. Eg. add+load as indexed load needs to check the immediate offset is zero, the registers are consistent between the add and load. That's significantly more complex than using a major opcode for indexed loads and stores.
> > > So in my opinion RISC-V designers made the right call here.
> >
> > By only allowing the very high-end get any benefit from fusion? No other implementation
> > could benefit from indexing, load multiple or even zero/sign-extend?
> >
> There are advantages of not having these instructions, such as having large space available
> for custom instructions, allowing compressed instructions for both 32-bit and 64-bit,
> enable students to design a CPU implementation in one or two semesters, etc.
Let me get this straight - a half-finished ISA is an advantage since it makes it easier for everybody to try to fix it up with their own incompatible custom instructions?!?
Wilco
> Wilco (wilco.dijkstra.delete@this.ntlworld.com) on November 3, 2019 11:09 am wrote:
> > The size of a core is already insignificant at the low end. Flash, SRAM and peripherals
> > take most of the die space of a microcontroller. A Cortex-M23 core takes 0.0037 mm^2
> > on 28HPC despite supporting multiply, division, load/store multiple etc.
> >
> Footprint of a core is not so insignificant when using older process technologies or FPGA soft cores.
This is a die shot of a low-end 180nm Cortex-M3 microcontroller. The core takes less than 4% of the die (a Cortex-M23 would be about half that). Saving another 10-20% from the core size doesn't help the bottom line since it will lower performance and increase power consumption, making the device less competitive.
> > Reducing the size of a core to the absolute bare minimum
> > implies executing extra instructions, and thus a higher
> > frequency to get the same performance, plus extra power to fetch those instructions and maybe even a larger
> > flash if your code no longer fits. You have to look beyond the decoder and consider the whole system.
> >
> > > On high end implementations, the number of transistors needed for branch prediction, out of order
> > > execution, uncore, etc. will dwarf for what is needed to implement instruction fusion.
> >
> > Instruction fusion adds a non-zero cost even in a high-end implementation (for
> > example extra decode cycles and thus higher mispredict latency), so an implementation
> > of an ISA which doesn't need such fusion would still be faster.
> >
> It has been a while since I watched that video. But if I remember correctly, fusion works
> by treating two 16-bit instructions as a single 32-bit instruction. The decode will be more
> complex but this approach would not necessary require any additional decode cycles.
That sounds like a bad idea since you now have to check many more bits before you know how to decode an instruction. Eg. add+load as indexed load needs to check the immediate offset is zero, the registers are consistent between the add and load. That's significantly more complex than using a major opcode for indexed loads and stores.
> > > So in my opinion RISC-V designers made the right call here.
> >
> > By only allowing the very high-end get any benefit from fusion? No other implementation
> > could benefit from indexing, load multiple or even zero/sign-extend?
> >
> There are advantages of not having these instructions, such as having large space available
> for custom instructions, allowing compressed instructions for both 32-bit and 64-bit,
> enable students to design a CPU implementation in one or two semesters, etc.
Let me get this straight - a half-finished ISA is an advantage since it makes it easier for everybody to try to fix it up with their own incompatible custom instructions?!?
Wilco