By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), May 23, 2022 10:43 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on May 23, 2022 1:04 pm wrote:
> It sounds like some sort of software abstraction rather than HW with very wide registers.
Not sure why?
> I don't quite know what is "ganged together via RVV's LMUL=8"
It has the effect of repeating the instruction 8 times, generating 8 vector registers of results (contiguous in the register file).
> and 2, 3 or 4 for path that does not go through multiplier then 1024-bit vectors (LMUL=2) are
> sufficient for effective hiding of the latency. Ganging more registers together would be harm
> (due to excessive padding and due to having too little "register names") for no gain.
I agree with your reasoning, if you need more than 32/8 "register names" then one should use a smaller LMUL. On the other hand, LMUL>1 is useful for machines that issue only one vector instruction per cycle, and fewer instructions could also mean lower total energy consumption.
> It sounds like some sort of software abstraction rather than HW with very wide registers.
Not sure why?
> I don't quite know what is "ganged together via RVV's LMUL=8"
It has the effect of repeating the instruction 8 times, generating 8 vector registers of results (contiguous in the register file).
> and 2, 3 or 4 for path that does not go through multiplier then 1024-bit vectors (LMUL=2) are
> sufficient for effective hiding of the latency. Ganging more registers together would be harm
> (due to excessive padding and due to having too little "register names") for no gain.
I agree with your reasoning, if you need more than 32/8 "register names" then one should use a smaller LMUL. On the other hand, LMUL>1 is useful for machines that issue only one vector instruction per cycle, and fewer instructions could also mean lower total energy consumption.