By: Adrian (a.delete@this.acm.org), July 25, 2021 10:32 pm
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on July 25, 2021 9:16 pm wrote:
> dmcq (dmcq.delete@this.fano.co.uk) on July 25, 2021 5:36 pm wrote:
> > Introducing the Scalable Matrix Extension for the Armv9-A Architecture
> >
> > I noticed ARM have issued yet another of their planned future
> > architecture blogs, this one is called "Scalable
> > Matrix Extension". Same sort of idea as what Intel is implementing as far as I can see. Except there is
> > a strange aspect in that it requires a new mode for SVE called "Streaming mode" SVE. They talk about having
> > the new SME instructons and a significant subset of the existing SVE2 instructions. And they say that
> > one could have a longer length for the registers in streaming and non-streaming mode. As far as I can
> > make out in fact only very straightforward operations are included in streaming mode.
> >
> > I guess that instead of being 'RISC' instructions these would have a loop dealing with widths greater
> > than the hardware SVE register or memory or cache width. Thy've implemented something like this in
> > the Cortex-M Helium extension, but I'd have thought they could just rely on OoO for the larger application
> > processor If it is so they can have larger tiles in their matrix multiplicatin I'd have thought there
> > would be other tricks that could do the job without a new mode. However I can't see they would have
> > put in a new mode without it being very important to them. Am I missing something?
>
>
>
> You are probably right about the necessity of looping in certain cases.
>
> They explain clearly enough why a streaming SVE mode is needed, to be able to present to
> software an apparent vector register width that is larger than the width of the ALU's.
>
> This "streaming" mode is actually exactly like the traditional vector computers have operated. For
> example a Cray-1 had an apparent vector register width of 1024 bits, but the vector operations were
> computed by a 64-bit pipelined ALU, in multiple clock cycles, i.e. "looping", like you say.
>
>
> They also explain clearly enough why extra matrix instructions are useful. The throughput
> of a computational program operating on large data structures on a modern CPU is normally
> limited by the memory throughput, usually by the memory load throughput.
>
> For many problems you can reach easily the memory throughput on any CPU and
> there is nothing that can be done to improve the performance above that.
>
> However the problems that can be solved using matrix-matrix operations are an exception, because for large
> matrices the ration between arithmetic operation and memory loads can become arbitrarily large, so increasing
> the number of arithmetic operations that can be done per memory load can increase the performance.
>
> The only way forward to greatly increase the number of arithmetic operations
> per memory load is to add matrix instructions, a.k.a. tensor instructions.
>
> The instructions added by Intel in Sapphire Rapids have a very limited usefulness. Due to
> their low precision they can be used only for Machine Learning. The ARM extension appears
> to be much more general purpose, so it should be also useful for other applications.
>
> Unfortunately it is not clear how many years will pass until the
> introduction of ARM cores implementing this ISA extension.
>
Unfortunately, I have pressed "Post" without rereading the message, and here there is no way to edit it.
There are a few typos, but the most important is that I have meant to say that Cray-1 had an apparent vector register width of 4096 bit (= 64 * 64), not 1024 bit.
> dmcq (dmcq.delete@this.fano.co.uk) on July 25, 2021 5:36 pm wrote:
> > Introducing the Scalable Matrix Extension for the Armv9-A Architecture
> >
> > I noticed ARM have issued yet another of their planned future
> > architecture blogs, this one is called "Scalable
> > Matrix Extension". Same sort of idea as what Intel is implementing as far as I can see. Except there is
> > a strange aspect in that it requires a new mode for SVE called "Streaming mode" SVE. They talk about having
> > the new SME instructons and a significant subset of the existing SVE2 instructions. And they say that
> > one could have a longer length for the registers in streaming and non-streaming mode. As far as I can
> > make out in fact only very straightforward operations are included in streaming mode.
> >
> > I guess that instead of being 'RISC' instructions these would have a loop dealing with widths greater
> > than the hardware SVE register or memory or cache width. Thy've implemented something like this in
> > the Cortex-M Helium extension, but I'd have thought they could just rely on OoO for the larger application
> > processor If it is so they can have larger tiles in their matrix multiplicatin I'd have thought there
> > would be other tricks that could do the job without a new mode. However I can't see they would have
> > put in a new mode without it being very important to them. Am I missing something?
>
>
>
> You are probably right about the necessity of looping in certain cases.
>
> They explain clearly enough why a streaming SVE mode is needed, to be able to present to
> software an apparent vector register width that is larger than the width of the ALU's.
>
> This "streaming" mode is actually exactly like the traditional vector computers have operated. For
> example a Cray-1 had an apparent vector register width of 1024 bits, but the vector operations were
> computed by a 64-bit pipelined ALU, in multiple clock cycles, i.e. "looping", like you say.
>
>
> They also explain clearly enough why extra matrix instructions are useful. The throughput
> of a computational program operating on large data structures on a modern CPU is normally
> limited by the memory throughput, usually by the memory load throughput.
>
> For many problems you can reach easily the memory throughput on any CPU and
> there is nothing that can be done to improve the performance above that.
>
> However the problems that can be solved using matrix-matrix operations are an exception, because for large
> matrices the ration between arithmetic operation and memory loads can become arbitrarily large, so increasing
> the number of arithmetic operations that can be done per memory load can increase the performance.
>
> The only way forward to greatly increase the number of arithmetic operations
> per memory load is to add matrix instructions, a.k.a. tensor instructions.
>
> The instructions added by Intel in Sapphire Rapids have a very limited usefulness. Due to
> their low precision they can be used only for Machine Learning. The ARM extension appears
> to be much more general purpose, so it should be also useful for other applications.
>
> Unfortunately it is not clear how many years will pass until the
> introduction of ARM cores implementing this ISA extension.
>
Unfortunately, I have pressed "Post" without rereading the message, and here there is no way to edit it.
There are a few typos, but the most important is that I have meant to say that Cray-1 had an apparent vector register width of 4096 bit (= 64 * 64), not 1024 bit.
Topic | Posted By | Date |
---|---|---|
ARM Scalable Matrix Extension | dmcq | 2021/07/25 05:36 PM |
ARM Scalable Matrix Extension | Adrian | 2021/07/25 09:16 PM |
Sorry, typos | Adrian | 2021/07/25 10:32 PM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 12:21 AM |
ARM SVE Streaming Mode | dmcq | 2021/07/26 04:18 AM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 04:45 AM |
ARM Scalable Matrix Extension | Michael S | 2021/07/26 02:53 AM |
ARM Scalable Matrix Extension | Adrian | 2021/07/26 03:41 AM |
Inner & outer product | Adrian | 2021/07/26 03:52 AM |
ARM Scalable Matrix Extension | Rayla | 2021/07/26 05:08 AM |
ARM Scalable Matrix Extension | dmcq | 2021/07/26 05:38 AM |
ARM Scalable Matrix Extension | Doug S | 2021/07/26 11:38 AM |
ARM Scalable Matrix Extension | Brett | 2021/07/26 01:54 PM |
ARM Scalable Matrix Extension | --- | 2021/07/26 05:48 PM |
ARM Scalable Matrix Extension | dmcq | 2021/07/27 02:39 AM |
ARM Scalable Matrix Extension | Anon | 2021/07/26 06:08 AM |
ARM Scalable Matrix Extension | lkcl | 2022/07/28 03:38 PM |
ARM Scalable Matrix Extension | dmcq | 2022/07/29 02:24 PM |
ARM Scalable Matrix Extension | lkcl | 2022/07/29 03:44 PM |