By: Adrian (a.delete@this.acm.org), July 25, 2021 9:16 pm
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on July 25, 2021 5:36 pm wrote:
> Introducing the Scalable Matrix Extension for the Armv9-A Architecture
>
> I noticed ARM have issued yet another of their planned future architecture blogs, this one is called "Scalable
> Matrix Extension". Same sort of idea as what Intel is implementing as far as I can see. Except there is
> a strange aspect in that it requires a new mode for SVE called "Streaming mode" SVE. They talk about having
> the new SME instructons and a significant subset of the existing SVE2 instructions. And they say that
> one could have a longer length for the registers in streaming and non-streaming mode. As far as I can
> make out in fact only very straightforward operations are included in streaming mode.
>
> I guess that instead of being 'RISC' instructions these would have a loop dealing with widths greater
> than the hardware SVE register or memory or cache width. Thy've implemented something like this in
> the Cortex-M Helium extension, but I'd have thought they could just rely on OoO for the larger application
> processor If it is so they can have larger tiles in their matrix multiplicatin I'd have thought there
> would be other tricks that could do the job without a new mode. However I can't see they would have
> put in a new mode without it being very important to them. Am I missing something?
You are probably right about the necessity of looping in certain cases.
They explain clearly enough why a streaming SVE mode is needed, to be able to present to software an apparent vector register width that is larger than the width of the ALU's.
This "streaming" mode is actually exactly like the traditional vector computers have operated. For example a Cray-1 had an apparent vector register width of 1024 bits, but the vector operations were computed by a 64-bit pipelined ALU, in multiple clock cycles, i.e. "looping", like you say.
They also explain clearly enough why extra matrix instructions are useful. The throughput of a computational program operating on large data structures on a modern CPU is normally limited by the memory throughput, usually by the memory load throughput.
For many problems you can reach easily the memory throughput on any CPU and there is nothing that can be done to improve the performance above that.
However the problems that can be solved using matrix-matrix operations are an exception, because for large matrices the ration between arithmetic operation and memory loads can become arbitrarily large, so increasing the number of arithmetic operations that can be done per memory load can increase the performance.
The only way forward to greatly increase the number of arithmetic operations per memory load is to add matrix instructions, a.k.a. tensor instructions.
The instructions added by Intel in Sapphire Rapids have a very limited usefulness. Due to their low precision they can be used only for Machine Learning. The ARM extension appears to be much more general purpose, so it should be also useful for other applications.
Unfortunately it is not clear how many years will pass until the introduction of ARM cores implementing this ISA extension.
> Introducing the Scalable Matrix Extension for the Armv9-A Architecture
>
> I noticed ARM have issued yet another of their planned future architecture blogs, this one is called "Scalable
> Matrix Extension". Same sort of idea as what Intel is implementing as far as I can see. Except there is
> a strange aspect in that it requires a new mode for SVE called "Streaming mode" SVE. They talk about having
> the new SME instructons and a significant subset of the existing SVE2 instructions. And they say that
> one could have a longer length for the registers in streaming and non-streaming mode. As far as I can
> make out in fact only very straightforward operations are included in streaming mode.
>
> I guess that instead of being 'RISC' instructions these would have a loop dealing with widths greater
> than the hardware SVE register or memory or cache width. Thy've implemented something like this in
> the Cortex-M Helium extension, but I'd have thought they could just rely on OoO for the larger application
> processor If it is so they can have larger tiles in their matrix multiplicatin I'd have thought there
> would be other tricks that could do the job without a new mode. However I can't see they would have
> put in a new mode without it being very important to them. Am I missing something?
You are probably right about the necessity of looping in certain cases.
They explain clearly enough why a streaming SVE mode is needed, to be able to present to software an apparent vector register width that is larger than the width of the ALU's.
This "streaming" mode is actually exactly like the traditional vector computers have operated. For example a Cray-1 had an apparent vector register width of 1024 bits, but the vector operations were computed by a 64-bit pipelined ALU, in multiple clock cycles, i.e. "looping", like you say.
They also explain clearly enough why extra matrix instructions are useful. The throughput of a computational program operating on large data structures on a modern CPU is normally limited by the memory throughput, usually by the memory load throughput.
For many problems you can reach easily the memory throughput on any CPU and there is nothing that can be done to improve the performance above that.
However the problems that can be solved using matrix-matrix operations are an exception, because for large matrices the ration between arithmetic operation and memory loads can become arbitrarily large, so increasing the number of arithmetic operations that can be done per memory load can increase the performance.
The only way forward to greatly increase the number of arithmetic operations per memory load is to add matrix instructions, a.k.a. tensor instructions.
The instructions added by Intel in Sapphire Rapids have a very limited usefulness. Due to their low precision they can be used only for Machine Learning. The ARM extension appears to be much more general purpose, so it should be also useful for other applications.
Unfortunately it is not clear how many years will pass until the introduction of ARM cores implementing this ISA extension.
Topic | Posted By | Date |
---|---|---|
ARM Scalable Matrix Extension | dmcq | 2021/07/25 05:36 PM |
ARM Scalable Matrix Extension | Adrian | 2021/07/25 09:16 PM |
Sorry, typos | Adrian | 2021/07/25 10:32 PM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 12:21 AM |
ARM SVE Streaming Mode | dmcq | 2021/07/26 04:18 AM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 04:45 AM |
ARM Scalable Matrix Extension | Michael S | 2021/07/26 02:53 AM |
ARM Scalable Matrix Extension | Adrian | 2021/07/26 03:41 AM |
Inner & outer product | Adrian | 2021/07/26 03:52 AM |
ARM Scalable Matrix Extension | Rayla | 2021/07/26 05:08 AM |
ARM Scalable Matrix Extension | dmcq | 2021/07/26 05:38 AM |
ARM Scalable Matrix Extension | Doug S | 2021/07/26 11:38 AM |
ARM Scalable Matrix Extension | Brett | 2021/07/26 01:54 PM |
ARM Scalable Matrix Extension | --- | 2021/07/26 05:48 PM |
ARM Scalable Matrix Extension | dmcq | 2021/07/27 02:39 AM |
ARM Scalable Matrix Extension | Anon | 2021/07/26 06:08 AM |
ARM Scalable Matrix Extension | lkcl | 2022/07/28 03:38 PM |
ARM Scalable Matrix Extension | dmcq | 2022/07/29 02:24 PM |
ARM Scalable Matrix Extension | lkcl | 2022/07/29 03:44 PM |