Matrix Math Accelerator

By: Michael S (, August 17, 2020 1:32 am
Adrian ( on August 17, 2020 1:01 am wrote:
> Crystal S. Diamond ( on August 16, 2020 10:20 pm wrote:
> > Here it is boys...
> >
> >
> So IBM also jumps on the bandwagon of adding ISA extensions for matrix
> computations, like NVIDIA, Intel Sapphire Rapids, future ARM etc.

> New Processor Core Architectures in the IBM POWER10 processor with an embedded Matrix
> Math Accelerator which is extrapolated to provide 10x, 15x and 20x faster AI inference
> for FP32, BFloat16 and INT8 calculations per socket respectively than the IBM POWER9
> processor to infuse AI into business applications and drive greater insights.


IBM appears to do it on FP32 inputs. Is it the case for the rest of them?
4x4 FP32 matmul would be useful not just for deep learning.
The problem is that 2048 bits per core per cycle is not very impressive for such massive core, that is more similar to 2-4 cores by other manufacturers.
A64FX and Cascade Lake already do 1024 bits per core without special instructions.

I wonder why they expose their engine as an outer-product.
Of course, outer product is very generic, and likely fits well in pipeline due to latency, comparable to "normal" FP operations, but it seems to me that it makes good FLOPS gains impossible, because of RF write bottleneck.
