By: dmcq (dmcq.delete@this.fano.co.uk), July 27, 2021 2:39 am
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on July 26, 2021 5:48 pm wrote:
> Brett (ggtgp.delete@this.yahoo.com) on July 26, 2021 1:54 pm wrote:
> > Doug S (foo.delete@this.bar.bar) on July 26, 2021 11:38 am wrote:
> > > dmcq (dmcq.delete@this.fano.co.uk) on July 26, 2021 5:38 am wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on July 26, 2021 2:53 am wrote:
> > > > > Adrian (a.delete@this.acm.org) on July 25, 2021 9:16 pm wrote:
> > > > > > dmcq (dmcq.delete@this.fano.co.uk) on July 25, 2021 5:36 pm wrote:
> > > > > > > Introducing the Scalable Matrix Extension for the Armv9-A Architecture
> > > > > > >
> > > > > > > I noticed ARM have issued yet another of their planned future
> > > > > > > architecture blogs, this one is called "Scalable
> > > > > > > Matrix Extension". Same sort of idea as what Intel is implementing as far as I can see. Except there is
> > > > > > > a strange aspect in that it requires a new mode for SVE called "Streaming mode" SVE. They talk about having
> > > > > > > the new SME instructons and a significant subset of the existing SVE2 instructions. And they say that
> > > > > > > one could have a longer length for the registers in streaming and non-streaming mode. As far as I can
> > > > > > > make out in fact only very straightforward operations are included in streaming mode.
> > > > > > >
> > > > > > > I guess that instead of being 'RISC' instructions these would have a loop dealing with widths greater
> > > > > > > than the hardware SVE register or memory or cache width. Thy've implemented something like this in
> > > > > > > the Cortex-M Helium extension, but I'd have thought they could just rely on OoO for the larger application
> > > > > > > processor If it is so they can have larger tiles in their matrix multiplicatin I'd have thought there
> > > > > > > would be other tricks that could do the job without a new mode. However I can't see they would have
> > > > > > > put in a new mode without it being very important to them. Am I missing something?
> > > > > >
> > > > > > You are probably right about the necessity of looping in certain cases.
> > > > > >
> > > > > > They explain clearly enough why a streaming SVE mode is needed, to be able to present to
> > > > > > software an apparent vector register width that is larger than the width of the ALU's.
> > > > > >
> > > > > > This "streaming" mode is actually exactly like the traditional vector computers have operated. For
> > > > > > example a Cray-1 had an apparent vector register width of 1024 bits, but the vector operations were
> > > > > > computed by a 64-bit pipelined ALU, in multiple clock cycles, i.e. "looping", like you say.
> > > > > >
> > > > > > They also explain clearly enough why extra matrix instructions are useful. The throughput
> > > > > > of a computational program operating on large data structures on a modern CPU is normally
> > > > > > limited by the memory throughput, usually by the memory load throughput.
> > > > > >
> > > > > > For many problems you can reach easily the memory throughput on any CPU and
> > > > > > there is nothing that can be done to improve the performance above that.
> > > > > >
> > > > > > However the problems that can be solved using matrix-matrix operations are an exception, because for large
> > > > > > matrices the ration between arithmetic operation and memory
> > > > > > loads can become arbitrarily large, so increasing
> > > > > > the number of arithmetic operations that can be done per memory load can increase the performance.
> > > > > >
> > > > > > The only way forward to greatly increase the number of arithmetic operations
> > > > > > per memory load is to add matrix instructions,
> > > > >
> > > > > It depends on the definition of "greatly" and on specific operation. For the simplest, but fortunately
> > > > > the most common, operation of matrix multiply, it is possible to achieve quite good ratio
> > > > > with quite small modification to AVX512-like ISA.
> > > > > All we need is ability to broadcast one of the multiplicands of SIMD FMA from
> > > > > arbitrary lane in register vector. AVX512, in its current form, can do it
> > > > > from memory, but not from register, which is good, but insufficient.
> > > > > So, what become possible with such minor change in the ISA?
> > > > > Let's assume double-precision matrix multiplication, ISA with 32x512-bit software-visible
> > > > > vector registers and 2-4 times as many physical vector registers in OoO back end.
> > > > > In the inner loop we multiply 10 rows by 16 columns (=2 SIMD columns).
> > > > > 10x16=160 accumulators occupy 160/8 = 20 vector registers.
> > > > > Additionally, we need 10 VRs for row inputs and 1 VR for column input (loaded
> > > > > with 16 different values throughout single iteration of the inner loop).
> > > > > So, overall we need 20+10+1=31 software-visible VRs, which is less than 32.
> > > > > Each of 160 accumulators is updated 8 times per one iteration of
> > > > > the loop, so there are 1280 FMAs per iteration == 160 SIMD FMAs.
> > > > > On each iteration we load 10x8=80 row inputs + 8*16=128 column
> > > > > inputs. Overall 208 double precision loads = 26 SIMD loads.
> > > > > FMA-to-load ratio = 160/26=6.15.
> > > > > Whether you call it "great" or just "large" is your call, but personally I don't expect that
> > > > > any CPU core that is still considered "CPU" rather than "accelerator" would ever have FMA-to-load
> > > > > ratio above 4, let alone, above 6. Right now, it seems, none of them exceeds 1.5.
> > > > >
> > > > > > a.k.a. tensor instructions.
> > > > >
> > > > > If by "tensor instructions" you mean "outer product" (I think, nowadays
> > > > > that's the most common meaning) than I tend to disagree.
> > > > > IMHO, many tasks can be accelerated quite nicely by "inner product" instructions. And intuitively,
> > > > > without being a chip architect, I would think that "inner product" is inherently more energy
> > > > > efficient than "outer product", due to lower fan out to register files.
> > > > >
> > > > > But, as shown above, neither is needed for specific task of [big] matrix multiply.
> > > > > Now, small or medium matrix multiply is completely different kettle of
> > > > > fish. BTW, much more important for me, from practical perspective.
> > > > >
> > > > > > The instructions added by Intel in Sapphire Rapids have a very limited usefulness. Due to
> > > > > > their low precision they can be used only for Machine Learning. The ARM extension appears
> > > > > > to be much more general purpose, so it should be also useful for other applications.
> > > > > >
> > > > > > Unfortunately it is not clear how many years will pass until the
> > > > > > introduction of ARM cores implementing this ISA extension.
> > > > >
> > > > > In case of Apple, the most probable answer is "never".
> > > > >
> > > > > In case of QC, it's still not clear if they are coming back to business of developing their
> > > > > own cores. Although if they don't then why they paid astronomical sum fir NUVIA?
> > > > >
> > > > > In case of Arm Inc., 3 years sound too optimistic. 5 years - may be.
> > > > >
> > > > > As usual, it leaves Fujitsu. May be, they are already in process of doing it
> > > > > and current announcement is just a posteriori documentation of their work?
> > > >
> > > > Apple already have a facility to do matrix multiplication operations. I don't know how it compares but
> > > > it would be peculiar not to support these operations eventually especially in their desktop computers.
> > > > And they would be involved in any discussions ARM have about future enhancements so they probably knew
> > > > about this quite a while ago judging from how long ARM was discussing SVE before it described its plans.
> > >
> > > Apple already added their own matrix instructions (AMX) with the A13 almost two years ago. They seem
> > > to be strongly encouraging developers to use library calls to access them, rather than developers writing
> > > their own code using them. The most likely reasons for doing that would be if they want to be able
> > > to make incompatible changes to AMX, or to be able to completely replace AMX down the road.
> >
> > Little cores are not going to have AMX, it is better to make a call that does
> > lots of AMX operations remotely than faulting on each AMX operation.
> >
> > > We can only guess whether Apple intends to continue going their
> > > own way with AMX, or it was intended as a placeholder
> > > until SME was ready. While Apple would have input into SME's design as one of ARM's biggest partners, they
> > > may prefer their own matrix instructions as they can be precisely
> > > targeted at Apple customers' needs. ARM needs
> > > to consider the whole market from mobile to HPC, Apple's product line isn't quite so broad.
> >
> >
>
> Dougall says Ice has AMX
> https://twitter.com/dougallj/status/1373973478731255812
>
> Apple AMX operates significantly differently from NEON in that it's essentially an asynchronous accelerator
> that just happens to use the host CPU for instruction fetch, scheduling, and similar grunt work. It could
> look like any external accelerator except by being so tightly placed inside the CPU you don't have to waste
> overhead with DMA (just use std register load instructions), and synchronization is somewhat easier.
>
> It's conceivable that what ARM means by SVE/SME "streaming" mode is a similar degree of asynchronicity
> -- get most of the convenience of having the instruction stream look like it's part of the standard
> instruction stream, but for some purposes (including but not limited to, what you can expect regarding
> [lack of] out of order behavior, or synchronization of this engine with the rest of the core) you
> need to view streaming mode as an asynchronous, loosely coupled accelerator.
> Hwacha would be something similar, I think.
>
I think you're right okay. They say there could be a considerable delay in any subsequent operation using flags or general registers where the value was produced by a streaming SVE operation.
> Brett (ggtgp.delete@this.yahoo.com) on July 26, 2021 1:54 pm wrote:
> > Doug S (foo.delete@this.bar.bar) on July 26, 2021 11:38 am wrote:
> > > dmcq (dmcq.delete@this.fano.co.uk) on July 26, 2021 5:38 am wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on July 26, 2021 2:53 am wrote:
> > > > > Adrian (a.delete@this.acm.org) on July 25, 2021 9:16 pm wrote:
> > > > > > dmcq (dmcq.delete@this.fano.co.uk) on July 25, 2021 5:36 pm wrote:
> > > > > > > Introducing the Scalable Matrix Extension for the Armv9-A Architecture
> > > > > > >
> > > > > > > I noticed ARM have issued yet another of their planned future
> > > > > > > architecture blogs, this one is called "Scalable
> > > > > > > Matrix Extension". Same sort of idea as what Intel is implementing as far as I can see. Except there is
> > > > > > > a strange aspect in that it requires a new mode for SVE called "Streaming mode" SVE. They talk about having
> > > > > > > the new SME instructons and a significant subset of the existing SVE2 instructions. And they say that
> > > > > > > one could have a longer length for the registers in streaming and non-streaming mode. As far as I can
> > > > > > > make out in fact only very straightforward operations are included in streaming mode.
> > > > > > >
> > > > > > > I guess that instead of being 'RISC' instructions these would have a loop dealing with widths greater
> > > > > > > than the hardware SVE register or memory or cache width. Thy've implemented something like this in
> > > > > > > the Cortex-M Helium extension, but I'd have thought they could just rely on OoO for the larger application
> > > > > > > processor If it is so they can have larger tiles in their matrix multiplicatin I'd have thought there
> > > > > > > would be other tricks that could do the job without a new mode. However I can't see they would have
> > > > > > > put in a new mode without it being very important to them. Am I missing something?
> > > > > >
> > > > > > You are probably right about the necessity of looping in certain cases.
> > > > > >
> > > > > > They explain clearly enough why a streaming SVE mode is needed, to be able to present to
> > > > > > software an apparent vector register width that is larger than the width of the ALU's.
> > > > > >
> > > > > > This "streaming" mode is actually exactly like the traditional vector computers have operated. For
> > > > > > example a Cray-1 had an apparent vector register width of 1024 bits, but the vector operations were
> > > > > > computed by a 64-bit pipelined ALU, in multiple clock cycles, i.e. "looping", like you say.
> > > > > >
> > > > > > They also explain clearly enough why extra matrix instructions are useful. The throughput
> > > > > > of a computational program operating on large data structures on a modern CPU is normally
> > > > > > limited by the memory throughput, usually by the memory load throughput.
> > > > > >
> > > > > > For many problems you can reach easily the memory throughput on any CPU and
> > > > > > there is nothing that can be done to improve the performance above that.
> > > > > >
> > > > > > However the problems that can be solved using matrix-matrix operations are an exception, because for large
> > > > > > matrices the ration between arithmetic operation and memory
> > > > > > loads can become arbitrarily large, so increasing
> > > > > > the number of arithmetic operations that can be done per memory load can increase the performance.
> > > > > >
> > > > > > The only way forward to greatly increase the number of arithmetic operations
> > > > > > per memory load is to add matrix instructions,
> > > > >
> > > > > It depends on the definition of "greatly" and on specific operation. For the simplest, but fortunately
> > > > > the most common, operation of matrix multiply, it is possible to achieve quite good ratio
> > > > > with quite small modification to AVX512-like ISA.
> > > > > All we need is ability to broadcast one of the multiplicands of SIMD FMA from
> > > > > arbitrary lane in register vector. AVX512, in its current form, can do it
> > > > > from memory, but not from register, which is good, but insufficient.
> > > > > So, what become possible with such minor change in the ISA?
> > > > > Let's assume double-precision matrix multiplication, ISA with 32x512-bit software-visible
> > > > > vector registers and 2-4 times as many physical vector registers in OoO back end.
> > > > > In the inner loop we multiply 10 rows by 16 columns (=2 SIMD columns).
> > > > > 10x16=160 accumulators occupy 160/8 = 20 vector registers.
> > > > > Additionally, we need 10 VRs for row inputs and 1 VR for column input (loaded
> > > > > with 16 different values throughout single iteration of the inner loop).
> > > > > So, overall we need 20+10+1=31 software-visible VRs, which is less than 32.
> > > > > Each of 160 accumulators is updated 8 times per one iteration of
> > > > > the loop, so there are 1280 FMAs per iteration == 160 SIMD FMAs.
> > > > > On each iteration we load 10x8=80 row inputs + 8*16=128 column
> > > > > inputs. Overall 208 double precision loads = 26 SIMD loads.
> > > > > FMA-to-load ratio = 160/26=6.15.
> > > > > Whether you call it "great" or just "large" is your call, but personally I don't expect that
> > > > > any CPU core that is still considered "CPU" rather than "accelerator" would ever have FMA-to-load
> > > > > ratio above 4, let alone, above 6. Right now, it seems, none of them exceeds 1.5.
> > > > >
> > > > > > a.k.a. tensor instructions.
> > > > >
> > > > > If by "tensor instructions" you mean "outer product" (I think, nowadays
> > > > > that's the most common meaning) than I tend to disagree.
> > > > > IMHO, many tasks can be accelerated quite nicely by "inner product" instructions. And intuitively,
> > > > > without being a chip architect, I would think that "inner product" is inherently more energy
> > > > > efficient than "outer product", due to lower fan out to register files.
> > > > >
> > > > > But, as shown above, neither is needed for specific task of [big] matrix multiply.
> > > > > Now, small or medium matrix multiply is completely different kettle of
> > > > > fish. BTW, much more important for me, from practical perspective.
> > > > >
> > > > > > The instructions added by Intel in Sapphire Rapids have a very limited usefulness. Due to
> > > > > > their low precision they can be used only for Machine Learning. The ARM extension appears
> > > > > > to be much more general purpose, so it should be also useful for other applications.
> > > > > >
> > > > > > Unfortunately it is not clear how many years will pass until the
> > > > > > introduction of ARM cores implementing this ISA extension.
> > > > >
> > > > > In case of Apple, the most probable answer is "never".
> > > > >
> > > > > In case of QC, it's still not clear if they are coming back to business of developing their
> > > > > own cores. Although if they don't then why they paid astronomical sum fir NUVIA?
> > > > >
> > > > > In case of Arm Inc., 3 years sound too optimistic. 5 years - may be.
> > > > >
> > > > > As usual, it leaves Fujitsu. May be, they are already in process of doing it
> > > > > and current announcement is just a posteriori documentation of their work?
> > > >
> > > > Apple already have a facility to do matrix multiplication operations. I don't know how it compares but
> > > > it would be peculiar not to support these operations eventually especially in their desktop computers.
> > > > And they would be involved in any discussions ARM have about future enhancements so they probably knew
> > > > about this quite a while ago judging from how long ARM was discussing SVE before it described its plans.
> > >
> > > Apple already added their own matrix instructions (AMX) with the A13 almost two years ago. They seem
> > > to be strongly encouraging developers to use library calls to access them, rather than developers writing
> > > their own code using them. The most likely reasons for doing that would be if they want to be able
> > > to make incompatible changes to AMX, or to be able to completely replace AMX down the road.
> >
> > Little cores are not going to have AMX, it is better to make a call that does
> > lots of AMX operations remotely than faulting on each AMX operation.
> >
> > > We can only guess whether Apple intends to continue going their
> > > own way with AMX, or it was intended as a placeholder
> > > until SME was ready. While Apple would have input into SME's design as one of ARM's biggest partners, they
> > > may prefer their own matrix instructions as they can be precisely
> > > targeted at Apple customers' needs. ARM needs
> > > to consider the whole market from mobile to HPC, Apple's product line isn't quite so broad.
> >
> >
>
> Dougall says Ice has AMX
> https://twitter.com/dougallj/status/1373973478731255812
>
> Apple AMX operates significantly differently from NEON in that it's essentially an asynchronous accelerator
> that just happens to use the host CPU for instruction fetch, scheduling, and similar grunt work. It could
> look like any external accelerator except by being so tightly placed inside the CPU you don't have to waste
> overhead with DMA (just use std register load instructions), and synchronization is somewhat easier.
>
> It's conceivable that what ARM means by SVE/SME "streaming" mode is a similar degree of asynchronicity
> -- get most of the convenience of having the instruction stream look like it's part of the standard
> instruction stream, but for some purposes (including but not limited to, what you can expect regarding
> [lack of] out of order behavior, or synchronization of this engine with the rest of the core) you
> need to view streaming mode as an asynchronous, loosely coupled accelerator.
> Hwacha would be something similar, I think.
>
I think you're right okay. They say there could be a considerable delay in any subsequent operation using flags or general registers where the value was produced by a streaming SVE operation.
Topic | Posted By | Date |
---|---|---|
ARM Scalable Matrix Extension | dmcq | 2021/07/25 05:36 PM |
ARM Scalable Matrix Extension | Adrian | 2021/07/25 09:16 PM |
Sorry, typos | Adrian | 2021/07/25 10:32 PM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 12:21 AM |
ARM SVE Streaming Mode | dmcq | 2021/07/26 04:18 AM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 04:45 AM |
ARM Scalable Matrix Extension | Michael S | 2021/07/26 02:53 AM |
ARM Scalable Matrix Extension | Adrian | 2021/07/26 03:41 AM |
Inner & outer product | Adrian | 2021/07/26 03:52 AM |
ARM Scalable Matrix Extension | Rayla | 2021/07/26 05:08 AM |
ARM Scalable Matrix Extension | dmcq | 2021/07/26 05:38 AM |
ARM Scalable Matrix Extension | Doug S | 2021/07/26 11:38 AM |
ARM Scalable Matrix Extension | Brett | 2021/07/26 01:54 PM |
ARM Scalable Matrix Extension | --- | 2021/07/26 05:48 PM |
ARM Scalable Matrix Extension | dmcq | 2021/07/27 02:39 AM |
ARM Scalable Matrix Extension | Anon | 2021/07/26 06:08 AM |
ARM Scalable Matrix Extension | lkcl | 2022/07/28 03:38 PM |
ARM Scalable Matrix Extension | dmcq | 2022/07/29 02:24 PM |
ARM Scalable Matrix Extension | lkcl | 2022/07/29 03:44 PM |