ARM SVE Streaming Mode

By: dmcq (dmcq.delete@this.fano.co.uk), July 26, 2021 4:18 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on July 26, 2021 12:21 am wrote:
> Adrian (a.delete@this.acm.org) on July 25, 2021 9:16 pm wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on July 25, 2021 5:36 pm wrote:
> > > Introducing the Scalable Matrix Extension for the Armv9-A Architecture
> > >
> > > I noticed ARM have issued yet another of their planned future
> > > architecture blogs, this one is called "Scalable
> > > Matrix Extension". Same sort of idea as what Intel is implementing as far as I can see. Except there is
> > > a strange aspect in that it requires a new mode for SVE called "Streaming mode" SVE. They talk about having
> > > the new SME instructons and a significant subset of the existing SVE2 instructions. And they say that
> > > one could have a longer length for the registers in streaming and non-streaming mode. As far as I can
> > > make out in fact only very straightforward operations are included in streaming mode.
> > >
> > > I guess that instead of being 'RISC' instructions these would have a loop dealing with widths greater
> > > than the hardware SVE register or memory or cache width. Thy've implemented something like this in
> > > the Cortex-M Helium extension, but I'd have thought they could just rely on OoO for the larger application
> > > processor If it is so they can have larger tiles in their matrix multiplicatin I'd have thought there
> > > would be other tricks that could do the job without a new mode. However I can't see they would have
> > > put in a new mode without it being very important to them. Am I missing something?
> >
> >
> >
> > You are probably right about the necessity of looping in certain cases.
> >
> > They explain clearly enough why a streaming SVE mode is needed, to be able to present to
> > software an apparent vector register width that is larger than the width of the ALU's.
> >
> > This "streaming" mode is actually exactly like the traditional vector computers have operated. For
> > example a Cray-1 had an apparent vector register width of 1024 bits, but the vector operations were
> > computed by a 64-bit pipelined ALU, in multiple clock cycles, i.e. "looping", like you say.
> >
> >
>
>
> So the SVE Streaming Mode, by switching the apparent vector register length, will allow the choice
> between 2 sets of vector instructions, one with low-latency instructions processing few data per
> instruction and one with high-latency instructions processing many data per instruction.
>
> The high-latency instructions available in the "SVE Streaming Mode" will make
> it easier to achieve the maximum throughput, by requiring fewer instructions
> to do the work, but obviously they are not suitable for every task.
>
> The ability to change modes to get the desired compromise between latency and throughput is
> certainly very valuable, as long as switching modes will not require an excessive time.
>
> I assume that the ABI will specify that the non-streaming SVE mode is default, so any procedure needing
> the streaming mode will have to switch modes upon function entry and function exit, so it will have to
> gain enough from the high-latency complex instructions to recover the time lost with mode switching.

It does sound like they want to do that. However high latency isn't quite as easy as just having longer vectors. If their streaming operations have to be completed before dependent operations and take four cache operations per load or store then for

Load v1
load v2
add v3,v1,v2
store v3

there would be no load or store while the add is performed, whereas if there is a loop doing these operations with OoO the re would be no wasted load store time. And the streaming only gets worse if you do more arithmetic.

I think they'd have to implemet something like their Helium
Enhancing the Capabilities of the Smallest Devices
to avoid that sort of problem and get an overall gain. Now I think their implementation of that is amazing and it goes into Cortex-M - but it sounds to me like mixing it up with OoO would definitely make control more complex! Complexificationization I'd call it. :-) They must be pretty certain of their design and verificaton tools!
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
ARM Scalable Matrix Extensiondmcq2021/07/25 05:36 PM
  ARM Scalable Matrix ExtensionAdrian2021/07/25 09:16 PM
    Sorry, typosAdrian2021/07/25 10:32 PM
    ARM SVE Streaming ModeAdrian2021/07/26 12:21 AM
      ARM SVE Streaming Modedmcq2021/07/26 04:18 AM
        ARM SVE Streaming ModeAdrian2021/07/26 04:45 AM
    ARM Scalable Matrix ExtensionMichael S2021/07/26 02:53 AM
      ARM Scalable Matrix ExtensionAdrian2021/07/26 03:41 AM
        Inner & outer productAdrian2021/07/26 03:52 AM
      ARM Scalable Matrix ExtensionRayla2021/07/26 05:08 AM
      ARM Scalable Matrix Extensiondmcq2021/07/26 05:38 AM
        ARM Scalable Matrix ExtensionDoug S2021/07/26 11:38 AM
          ARM Scalable Matrix ExtensionBrett2021/07/26 01:54 PM
            ARM Scalable Matrix Extension---2021/07/26 05:48 PM
              ARM Scalable Matrix Extensiondmcq2021/07/27 02:39 AM
      ARM Scalable Matrix ExtensionAnon2021/07/26 06:08 AM
    ARM Scalable Matrix Extensionlkcl2022/07/28 03:38 PM
      ARM Scalable Matrix Extensiondmcq2022/07/29 02:24 PM
        ARM Scalable Matrix Extensionlkcl2022/07/29 03:44 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? ūüćä