By: Michael S (already5chosen.delete@this.yahoo.com), May 23, 2022 1:04 pm
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 23, 2022 5:49 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on May 23, 2022 2:32 am wrote:
> > Shipping RISC-V hardware with vectors? Or experimental boards?
> Here's mention of a supercomputer with 16384 bits: https://github.com/riscv/riscv-v-spec/issues/367
It sounds like some sort of software abstraction rather than HW with very wide registers.
> Sifive X280 (https://www.sifive.com/cores/intelligence-x280) has 512-bit vectors
> which when ganged together via RVV's LMUL=8, result in 4096 bit vectors.
I don't quite know what is "ganged together via RVV's LMUL=8", but if it is what I guess it is then
it sounds like a bad idea.
Underlying EU is 1x256bit. Which suggests that if latency of their FPU is in line with what considered today normal for relatively slowly clocked FPUs i.e. 4 clocks for FMAC path through multiplier and 2, 3 or 4 for path that does not go through multiplier then 1024-bit vectors (LMUL=2) are sufficient for effective hiding of the latency. Ganging more registers together would be harm (due to excessive padding and due to having too little "register names") for no gain.
If you have a board with X280, it would be easy for you to test if I am right on some of the typical kernels, e.g. on SGEMM.
> Michael S (already5chosen.delete@this.yahoo.com) on May 23, 2022 2:32 am wrote:
> > Shipping RISC-V hardware with vectors? Or experimental boards?
> Here's mention of a supercomputer with 16384 bits: https://github.com/riscv/riscv-v-spec/issues/367
It sounds like some sort of software abstraction rather than HW with very wide registers.
> Sifive X280 (https://www.sifive.com/cores/intelligence-x280) has 512-bit vectors
> which when ganged together via RVV's LMUL=8, result in 4096 bit vectors.
I don't quite know what is "ganged together via RVV's LMUL=8", but if it is what I guess it is then
it sounds like a bad idea.
Underlying EU is 1x256bit. Which suggests that if latency of their FPU is in line with what considered today normal for relatively slowly clocked FPUs i.e. 4 clocks for FMAC path through multiplier and 2, 3 or 4 for path that does not go through multiplier then 1024-bit vectors (LMUL=2) are sufficient for effective hiding of the latency. Ganging more registers together would be harm (due to excessive padding and due to having too little "register names") for no gain.
If you have a board with X280, it would be easy for you to test if I am right on some of the typical kernels, e.g. on SGEMM.