By: Michael S (already5chosen.delete@this.yahoo.com), July 26, 2021 2:53 am

Room: Moderated Discussions

Adrian (a.delete@this.acm.org) on July 25, 2021 9:16 pm wrote:

> dmcq (dmcq.delete@this.fano.co.uk) on July 25, 2021 5:36 pm wrote:

> > Introducing the Scalable Matrix Extension for the Armv9-A Architecture

> >

> > I noticed ARM have issued yet another of their planned future

> > architecture blogs, this one is called "Scalable

> > Matrix Extension". Same sort of idea as what Intel is implementing as far as I can see. Except there is

> > a strange aspect in that it requires a new mode for SVE called "Streaming mode" SVE. They talk about having

> > the new SME instructons and a significant subset of the existing SVE2 instructions. And they say that

> > one could have a longer length for the registers in streaming and non-streaming mode. As far as I can

> > make out in fact only very straightforward operations are included in streaming mode.

> >

> > I guess that instead of being 'RISC' instructions these would have a loop dealing with widths greater

> > than the hardware SVE register or memory or cache width. Thy've implemented something like this in

> > the Cortex-M Helium extension, but I'd have thought they could just rely on OoO for the larger application

> > processor If it is so they can have larger tiles in their matrix multiplicatin I'd have thought there

> > would be other tricks that could do the job without a new mode. However I can't see they would have

> > put in a new mode without it being very important to them. Am I missing something?

>

>

>

> You are probably right about the necessity of looping in certain cases.

>

> They explain clearly enough why a streaming SVE mode is needed, to be able to present to

> software an apparent vector register width that is larger than the width of the ALU's.

>

> This "streaming" mode is actually exactly like the traditional vector computers have operated. For

> example a Cray-1 had an apparent vector register width of 1024 bits, but the vector operations were

> computed by a 64-bit pipelined ALU, in multiple clock cycles, i.e. "looping", like you say.

>

>

> They also explain clearly enough why extra matrix instructions are useful. The throughput

> of a computational program operating on large data structures on a modern CPU is normally

> limited by the memory throughput, usually by the memory load throughput.

>

> For many problems you can reach easily the memory throughput on any CPU and

> there is nothing that can be done to improve the performance above that.

>

> However the problems that can be solved using matrix-matrix operations are an exception, because for large

> matrices the ration between arithmetic operation and memory loads can become arbitrarily large, so increasing

> the number of arithmetic operations that can be done per memory load can increase the performance.

>

> The only way forward to greatly increase the number of arithmetic operations

> per memory load is to add matrix instructions,

It depends on the definition of "greatly" and on specific operation. For the simplest, but fortunately the most common, operation of matrix multiply, it is possible to achieve quite good ratio

with quite small modification to AVX512-like ISA.

All we need is ability to broadcast one of the multiplicands of SIMD FMA from arbitrary lane in register vector. AVX512, in its current form, can do it from memory, but not from register, which is good, but insufficient.

So, what become possible with such minor change in the ISA?

Let's assume double-precision matrix multiplication, ISA with 32x512-bit software-visible vector registers and 2-4 times as many physical vector registers in OoO back end.

In the inner loop we multiply 10 rows by 16 columns (=2 SIMD columns).

10x16=160 accumulators occupy 160/8 = 20 vector registers.

Additionally, we need 10 VRs for row inputs and 1 VR for column input (loaded with 16 different values throughout single iteration of the inner loop).

So, overall we need 20+10+1=31 software-visible VRs, which is less than 32.

Each of 160 accumulators is updated 8 times per one iteration of the loop, so there are 1280 FMAs per iteration == 160 SIMD FMAs.

On each iteration we load 10x8=80 row inputs + 8*16=128 column inputs. Overall 208 double precision loads = 26 SIMD loads.

FMA-to-load ratio = 160/26=6.15.

Whether you call it "great" or just "large" is your call, but personally I don't expect that any CPU core that is still considered "CPU" rather than "accelerator" would ever have FMA-to-load ratio above 4, let alone, above 6. Right now, it seems, none of them exceeds 1.5.

> a.k.a. tensor instructions.

If by "tensor instructions" you mean "outer product" (I think, nowadays that's the most common meaning) than I tend to disagree.

IMHO, many tasks can be accelerated quite nicely by "inner product" instructions. And intuitively, without being a chip architect, I would think that "inner product" is inherently more energy efficient than "outer product", due to lower fan out to register files.

But, as shown above, neither is needed for specific task of [big] matrix multiply.

Now, small or medium matrix multiply is completely different kettle of fish. BTW, much more important for me, from practical perspective.

>

> The instructions added by Intel in Sapphire Rapids have a very limited usefulness. Due to

> their low precision they can be used only for Machine Learning. The ARM extension appears

> to be much more general purpose, so it should be also useful for other applications.

>

> Unfortunately it is not clear how many years will pass until the

> introduction of ARM cores implementing this ISA extension.

>

In case of Apple, the most probable answer is "never".

In case of QC, it's still not clear if they are coming back to business of developing their own cores. Although if they don't then why they paid astronomical sum fir NUVIA?

In case of Arm Inc., 3 years sound too optimistic. 5 years - may be.

As usual, it leaves Fujitsu. May be, they are already in process of doing it and current announcement is just a posteriori documentation of their work?

> dmcq (dmcq.delete@this.fano.co.uk) on July 25, 2021 5:36 pm wrote:

> > Introducing the Scalable Matrix Extension for the Armv9-A Architecture

> >

> > I noticed ARM have issued yet another of their planned future

> > architecture blogs, this one is called "Scalable

> > Matrix Extension". Same sort of idea as what Intel is implementing as far as I can see. Except there is

> > a strange aspect in that it requires a new mode for SVE called "Streaming mode" SVE. They talk about having

> > the new SME instructons and a significant subset of the existing SVE2 instructions. And they say that

> > one could have a longer length for the registers in streaming and non-streaming mode. As far as I can

> > make out in fact only very straightforward operations are included in streaming mode.

> >

> > I guess that instead of being 'RISC' instructions these would have a loop dealing with widths greater

> > than the hardware SVE register or memory or cache width. Thy've implemented something like this in

> > the Cortex-M Helium extension, but I'd have thought they could just rely on OoO for the larger application

> > processor If it is so they can have larger tiles in their matrix multiplicatin I'd have thought there

> > would be other tricks that could do the job without a new mode. However I can't see they would have

> > put in a new mode without it being very important to them. Am I missing something?

>

>

>

> You are probably right about the necessity of looping in certain cases.

>

> They explain clearly enough why a streaming SVE mode is needed, to be able to present to

> software an apparent vector register width that is larger than the width of the ALU's.

>

> This "streaming" mode is actually exactly like the traditional vector computers have operated. For

> example a Cray-1 had an apparent vector register width of 1024 bits, but the vector operations were

> computed by a 64-bit pipelined ALU, in multiple clock cycles, i.e. "looping", like you say.

>

>

> They also explain clearly enough why extra matrix instructions are useful. The throughput

> of a computational program operating on large data structures on a modern CPU is normally

> limited by the memory throughput, usually by the memory load throughput.

>

> For many problems you can reach easily the memory throughput on any CPU and

> there is nothing that can be done to improve the performance above that.

>

> However the problems that can be solved using matrix-matrix operations are an exception, because for large

> matrices the ration between arithmetic operation and memory loads can become arbitrarily large, so increasing

> the number of arithmetic operations that can be done per memory load can increase the performance.

>

> The only way forward to greatly increase the number of arithmetic operations

> per memory load is to add matrix instructions,

It depends on the definition of "greatly" and on specific operation. For the simplest, but fortunately the most common, operation of matrix multiply, it is possible to achieve quite good ratio

with quite small modification to AVX512-like ISA.

All we need is ability to broadcast one of the multiplicands of SIMD FMA from arbitrary lane in register vector. AVX512, in its current form, can do it from memory, but not from register, which is good, but insufficient.

So, what become possible with such minor change in the ISA?

Let's assume double-precision matrix multiplication, ISA with 32x512-bit software-visible vector registers and 2-4 times as many physical vector registers in OoO back end.

In the inner loop we multiply 10 rows by 16 columns (=2 SIMD columns).

10x16=160 accumulators occupy 160/8 = 20 vector registers.

Additionally, we need 10 VRs for row inputs and 1 VR for column input (loaded with 16 different values throughout single iteration of the inner loop).

So, overall we need 20+10+1=31 software-visible VRs, which is less than 32.

Each of 160 accumulators is updated 8 times per one iteration of the loop, so there are 1280 FMAs per iteration == 160 SIMD FMAs.

On each iteration we load 10x8=80 row inputs + 8*16=128 column inputs. Overall 208 double precision loads = 26 SIMD loads.

FMA-to-load ratio = 160/26=6.15.

Whether you call it "great" or just "large" is your call, but personally I don't expect that any CPU core that is still considered "CPU" rather than "accelerator" would ever have FMA-to-load ratio above 4, let alone, above 6. Right now, it seems, none of them exceeds 1.5.

> a.k.a. tensor instructions.

If by "tensor instructions" you mean "outer product" (I think, nowadays that's the most common meaning) than I tend to disagree.

IMHO, many tasks can be accelerated quite nicely by "inner product" instructions. And intuitively, without being a chip architect, I would think that "inner product" is inherently more energy efficient than "outer product", due to lower fan out to register files.

But, as shown above, neither is needed for specific task of [big] matrix multiply.

Now, small or medium matrix multiply is completely different kettle of fish. BTW, much more important for me, from practical perspective.

>

> The instructions added by Intel in Sapphire Rapids have a very limited usefulness. Due to

> their low precision they can be used only for Machine Learning. The ARM extension appears

> to be much more general purpose, so it should be also useful for other applications.

>

> Unfortunately it is not clear how many years will pass until the

> introduction of ARM cores implementing this ISA extension.

>

In case of Apple, the most probable answer is "never".

In case of QC, it's still not clear if they are coming back to business of developing their own cores. Although if they don't then why they paid astronomical sum fir NUVIA?

In case of Arm Inc., 3 years sound too optimistic. 5 years - may be.

As usual, it leaves Fujitsu. May be, they are already in process of doing it and current announcement is just a posteriori documentation of their work?

Topic | Posted By | Date |
---|---|---|

ARM Scalable Matrix Extension | dmcq | 2021/07/25 05:36 PM |

ARM Scalable Matrix Extension | Adrian | 2021/07/25 09:16 PM |

Sorry, typos | Adrian | 2021/07/25 10:32 PM |

ARM SVE Streaming Mode | Adrian | 2021/07/26 12:21 AM |

ARM SVE Streaming Mode | dmcq | 2021/07/26 04:18 AM |

ARM SVE Streaming Mode | Adrian | 2021/07/26 04:45 AM |

ARM Scalable Matrix Extension | Michael S | 2021/07/26 02:53 AM |

ARM Scalable Matrix Extension | Adrian | 2021/07/26 03:41 AM |

Inner & outer product | Adrian | 2021/07/26 03:52 AM |

ARM Scalable Matrix Extension | Rayla | 2021/07/26 05:08 AM |

ARM Scalable Matrix Extension | dmcq | 2021/07/26 05:38 AM |

ARM Scalable Matrix Extension | Doug S | 2021/07/26 11:38 AM |

ARM Scalable Matrix Extension | Brett | 2021/07/26 01:54 PM |

ARM Scalable Matrix Extension | --- | 2021/07/26 05:48 PM |

ARM Scalable Matrix Extension | dmcq | 2021/07/27 02:39 AM |

ARM Scalable Matrix Extension | Anon | 2021/07/26 06:08 AM |

ARM Scalable Matrix Extension | lkcl | 2022/07/28 03:38 PM |

ARM Scalable Matrix Extension | dmcq | 2022/07/29 02:24 PM |

ARM Scalable Matrix Extension | lkcl | 2022/07/29 03:44 PM |