ARM Scalable Matrix Extension

Michael S ( on July 26, 2021 2:53 am wrote:
> If by "tensor instructions" you mean "outer product" (I think, nowadays
> that's the most common meaning) than I tend to disagree.
> IMHO, many tasks can be accelerated quite nicely by "inner product" instructions. And intuitively,
> without being a chip architect, I would think that "inner product" is inherently more energy
> efficient than "outer product", due to lower fan out to register files.

If you are thinking about a traditional general-register to general-register architecture then probably the fan out would be a problem, but that's not the case, the number of possible output registers are very limited and they may not even be renamed which should limit a lot the fan out, in fact, the ZA should not be much larger than the already existing latches for pipelining the multipliers.
