By: Adrian (a.delete@this.acm.org), May 14, 2022 12:35 am
Room: Moderated Discussions
Matt Lohmann (mlohmann.delete@this.noemail.com) on May 13, 2022 1:32 pm wrote:
>
> What is the benefit of the AMX engine in the P core cluster given that each P
> core has 512b of NEON capability per cycle? Is the benefit of AMX in the P core
> cluster two operations per cycle versus one 512b NEON operation per cycle?
>
I do not know any details specific to Apple AMX.
Nevertheless, everybody (NVIDIA, Intel, AMD, ARM and Apple) has either already introduced or they will introduce soon various ISA extensions for matrix a.k.a. tensor operations, besides their traditional vector ISAs.
The ALUs used by matrix/tensor operations could also be used by vector operations, so for algorithms that already are limited by ALU throughput, there is no performance improvement achievable by switching to matrix operations from vector operations.
The benefit of matrix/tensor operations is that they transform algorithms previously limited by the lower throughput of the loads from cache memory into algorithms limited by ALU throughput.
This is due to the fact that the matrix/tensor operations use more internal registers, which are filled with rows or columns of data, and those rows or columns are then reused many times during a matrix/tensor operation, avoiding the need to reload them from cache memory, and also avoiding to fetch and decode multiple instructions for such operations and avoiding the too wide register fields in the instruction format, which would be needed if a very large number of scalar or short vector registers would be used to store a matrix, to avoid reloading it from cache.
So for many algorithms, the matrix/tensor operations can improve a lot the ratio between ALU operations and loads/stores from cache memory, resulting in a higher throughput, and this is their advantage.
>
> What is the benefit of the AMX engine in the P core cluster given that each P
> core has 512b of NEON capability per cycle? Is the benefit of AMX in the P core
> cluster two operations per cycle versus one 512b NEON operation per cycle?
>
I do not know any details specific to Apple AMX.
Nevertheless, everybody (NVIDIA, Intel, AMD, ARM and Apple) has either already introduced or they will introduce soon various ISA extensions for matrix a.k.a. tensor operations, besides their traditional vector ISAs.
The ALUs used by matrix/tensor operations could also be used by vector operations, so for algorithms that already are limited by ALU throughput, there is no performance improvement achievable by switching to matrix operations from vector operations.
The benefit of matrix/tensor operations is that they transform algorithms previously limited by the lower throughput of the loads from cache memory into algorithms limited by ALU throughput.
This is due to the fact that the matrix/tensor operations use more internal registers, which are filled with rows or columns of data, and those rows or columns are then reused many times during a matrix/tensor operation, avoiding the need to reload them from cache memory, and also avoiding to fetch and decode multiple instructions for such operations and avoiding the too wide register fields in the instruction format, which would be needed if a very large number of scalar or short vector registers would be used to store a matrix, to avoid reloading it from cache.
So for many algorithms, the matrix/tensor operations can improve a lot the ratio between ALU operations and loads/stores from cache memory, resulting in a higher throughput, and this is their advantage.