ARM Scalable Matrix Extension

By: Rayla (rayla.delete@this.example.com), July 26, 2021 5:08 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on July 26, 2021 2:53 am wrote:
> Adrian (a.delete@this.acm.org) on July 25, 2021 9:16 pm wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on July 25, 2021 5:36 pm wrote:
> > > Introducing the Scalable Matrix Extension for the Armv9-A Architecture
> > >
> > > I noticed ARM have issued yet another of their planned future
> > > architecture blogs, this one is called "Scalable
> > > Matrix Extension". Same sort of idea as what Intel is implementing as far as I can see. Except there is
> > > a strange aspect in that it requires a new mode for SVE called "Streaming mode" SVE. They talk about having
> > > the new SME instructons and a significant subset of the existing SVE2 instructions. And they say that
> > > one could have a longer length for the registers in streaming and non-streaming mode. As far as I can
> > > make out in fact only very straightforward operations are included in streaming mode.
> > >
> > > I guess that instead of being 'RISC' instructions these would have a loop dealing with widths greater
> > > than the hardware SVE register or memory or cache width. Thy've implemented something like this in
> > > the Cortex-M Helium extension, but I'd have thought they could just rely on OoO for the larger application
> > > processor If it is so they can have larger tiles in their matrix multiplicatin I'd have thought there
> > > would be other tricks that could do the job without a new mode. However I can't see they would have
> > > put in a new mode without it being very important to them. Am I missing something?
> >
> >
> >
> > You are probably right about the necessity of looping in certain cases.
> >
> > They explain clearly enough why a streaming SVE mode is needed, to be able to present to
> > software an apparent vector register width that is larger than the width of the ALU's.
> >
> > This "streaming" mode is actually exactly like the traditional vector computers have operated. For
> > example a Cray-1 had an apparent vector register width of 1024 bits, but the vector operations were
> > computed by a 64-bit pipelined ALU, in multiple clock cycles, i.e. "looping", like you say.
> >
> >
> > They also explain clearly enough why extra matrix instructions are useful. The throughput
> > of a computational program operating on large data structures on a modern CPU is normally
> > limited by the memory throughput, usually by the memory load throughput.
> >
> > For many problems you can reach easily the memory throughput on any CPU and
> > there is nothing that can be done to improve the performance above that.
> >
> > However the problems that can be solved using matrix-matrix operations are an exception, because for large
> > matrices the ration between arithmetic operation and memory
> > loads can become arbitrarily large, so increasing
> > the number of arithmetic operations that can be done per memory load can increase the performance.
> >
> > The only way forward to greatly increase the number of arithmetic operations
> > per memory load is to add matrix instructions,
>
> It depends on the definition of "greatly" and on specific operation. For the simplest, but fortunately
> the most common, operation of matrix multiply, it is possible to achieve quite good ratio
> with quite small modification to AVX512-like ISA.
> All we need is ability to broadcast one of the multiplicands of SIMD FMA from
> arbitrary lane in register vector. AVX512, in its current form, can do it
> from memory, but not from register, which is good, but insufficient.
> So, what become possible with such minor change in the ISA?
> Let's assume double-precision matrix multiplication, ISA with 32x512-bit software-visible
> vector registers and 2-4 times as many physical vector registers in OoO back end.
> In the inner loop we multiply 10 rows by 16 columns (=2 SIMD columns).
> 10x16=160 accumulators occupy 160/8 = 20 vector registers.
> Additionally, we need 10 VRs for row inputs and 1 VR for column input (loaded
> with 16 different values throughout single iteration of the inner loop).
> So, overall we need 20+10+1=31 software-visible VRs, which is less than 32.
> Each of 160 accumulators is updated 8 times per one iteration of
> the loop, so there are 1280 FMAs per iteration == 160 SIMD FMAs.
> On each iteration we load 10x8=80 row inputs + 8*16=128 column
> inputs. Overall 208 double precision loads = 26 SIMD loads.
> FMA-to-load ratio = 160/26=6.15.
> Whether you call it "great" or just "large" is your call, but personally I don't expect that
> any CPU core that is still considered "CPU" rather than "accelerator" would ever have FMA-to-load
> ratio above 4, let alone, above 6. Right now, it seems, none of them exceeds 1.5.
>
>
> > a.k.a. tensor instructions.
>
> If by "tensor instructions" you mean "outer product" (I think, nowadays
> that's the most common meaning) than I tend to disagree.
> IMHO, many tasks can be accelerated quite nicely by "inner product" instructions. And intuitively,
> without being a chip architect, I would think that "inner product" is inherently more energy
> efficient than "outer product", due to lower fan out to register files.
>
> But, as shown above, neither is needed for specific task of [big] matrix multiply.
> Now, small or medium matrix multiply is completely different kettle of
> fish. BTW, much more important for me, from practical perspective.
>
> >
> > The instructions added by Intel in Sapphire Rapids have a very limited usefulness. Due to
> > their low precision they can be used only for Machine Learning. The ARM extension appears
> > to be much more general purpose, so it should be also useful for other applications.
> >
> > Unfortunately it is not clear how many years will pass until the
> > introduction of ARM cores implementing this ISA extension.
> >
>
> In case of Apple, the most probable answer is "never".
>
> In case of QC, it's still not clear if they are coming back to business of developing their
> own cores. Although if they don't then why they paid astronomical sum fir NUVIA?
>
> In case of Arm Inc., 3 years sound too optimistic. 5 years - may be.
>
> As usual, it leaves Fujitsu. May be, they are already in process of doing it
> and current announcement is just a posteriori documentation of their work?
>

Qualcomm has stated outright that they're going to ship cores developed by Nuvia. I'm not sure why it's "unclear."

https://www.anandtech.com/show/16553/qualcomm-completes-acquisition-of-nuvia
https://www.reuters.com/technology/qualcomms-new-ceo-eyes-dominance-laptop-markets-2021-07-01/
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
ARM Scalable Matrix Extensiondmcq2021/07/25 05:36 PM
  ARM Scalable Matrix ExtensionAdrian2021/07/25 09:16 PM
    Sorry, typosAdrian2021/07/25 10:32 PM
    ARM SVE Streaming ModeAdrian2021/07/26 12:21 AM
      ARM SVE Streaming Modedmcq2021/07/26 04:18 AM
        ARM SVE Streaming ModeAdrian2021/07/26 04:45 AM
    ARM Scalable Matrix ExtensionMichael S2021/07/26 02:53 AM
      ARM Scalable Matrix ExtensionAdrian2021/07/26 03:41 AM
        Inner & outer productAdrian2021/07/26 03:52 AM
      ARM Scalable Matrix ExtensionRayla2021/07/26 05:08 AM
      ARM Scalable Matrix Extensiondmcq2021/07/26 05:38 AM
        ARM Scalable Matrix ExtensionDoug S2021/07/26 11:38 AM
          ARM Scalable Matrix ExtensionBrett2021/07/26 01:54 PM
            ARM Scalable Matrix Extension---2021/07/26 05:48 PM
              ARM Scalable Matrix Extensiondmcq2021/07/27 02:39 AM
      ARM Scalable Matrix ExtensionAnon2021/07/26 06:08 AM
    ARM Scalable Matrix Extensionlkcl2022/07/28 03:38 PM
      ARM Scalable Matrix Extensiondmcq2022/07/29 02:24 PM
        ARM Scalable Matrix Extensionlkcl2022/07/29 03:44 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊