By: Adrian (a.delete@this.acm.org), July 26, 2021 3:41 am

Room: Moderated Discussions

Michael S (already5chosen.delete@this.yahoo.com) on July 26, 2021 2:53 am wrote:

> >

> > The only way forward to greatly increase the number of arithmetic operations

> > per memory load is to add matrix instructions,

>

> It depends on the definition of "greatly" and on specific operation. For the simplest, but fortunately

> the most common, operation of matrix multiply, it is possible to achieve quite good ratio

> with quite small modification to AVX512-like ISA.

> All we need is ability to broadcast one of the multiplicands of SIMD FMA from

> arbitrary lane in register vector. AVX512, in its current form, can do it

> from memory, but not from register, which is good, but insufficient.

> So, what become possible with such minor change in the ISA?

> Let's assume double-precision matrix multiplication, ISA with 32x512-bit software-visible

> vector registers and 2-4 times as many physical vector registers in OoO back end.

> In the inner loop we multiply 10 rows by 16 columns (=2 SIMD columns).

> 10x16=160 accumulators occupy 160/8 = 20 vector registers.

> Additionally, we need 10 VRs for row inputs and 1 VR for column input (loaded

> with 16 different values throughout single iteration of the inner loop).

> So, overall we need 20+10+1=31 software-visible VRs, which is less than 32.

> Each of 160 accumulators is updated 8 times per one iteration of

> the loop, so there are 1280 FMAs per iteration == 160 SIMD FMAs.

> On each iteration we load 10x8=80 row inputs + 8*16=128 column

> inputs. Overall 208 double precision loads = 26 SIMD loads.

> FMA-to-load ratio = 160/26=6.15.

> Whether you call it "great" or just "large" is your call, but personally I don't expect that

> any CPU core that is still considered "CPU" rather than "accelerator" would ever have FMA-to-load

> ratio above 4, let alone, above 6. Right now, it seems, none of them exceeds 1.5.

>

>

> > a.k.a. tensor instructions.

>

> If by "tensor instructions" you mean "outer product" (I think, nowadays

> that's the most common meaning) than I tend to disagree.

> IMHO, many tasks can be accelerated quite nicely by "inner product" instructions. And intuitively,

> without being a chip architect, I would think that "inner product" is inherently more energy

> efficient than "outer product", due to lower fan out to register files.

>

> But, as shown above, neither is needed for specific task of [big] matrix multiply.

> Now, small or medium matrix multiply is completely different kettle of

> fish. BTW, much more important for me, from practical perspective.

>

> >

I agree that what you propose might be a simpler way to increase the performance of matrix operations.

Nevertheless, after looking at the SME instructions at

https://developer.arm.com/documentation/ddi0602/2021-06/SME-Instructions?lang=en

most of them implement various kinds of outer products, like the NVIDIA and Intel ISA extensions.

I assume that when they have selected the instructions to be included in this ISA extension, they have estimated the implementation cost, so the extra area and energy consumption should be reasonable, even if it is not certain that this is the most efficient way for increasing the throughput of matrix operations.

> >

> > The only way forward to greatly increase the number of arithmetic operations

> > per memory load is to add matrix instructions,

>

> It depends on the definition of "greatly" and on specific operation. For the simplest, but fortunately

> the most common, operation of matrix multiply, it is possible to achieve quite good ratio

> with quite small modification to AVX512-like ISA.

> All we need is ability to broadcast one of the multiplicands of SIMD FMA from

> arbitrary lane in register vector. AVX512, in its current form, can do it

> from memory, but not from register, which is good, but insufficient.

> So, what become possible with such minor change in the ISA?

> Let's assume double-precision matrix multiplication, ISA with 32x512-bit software-visible

> vector registers and 2-4 times as many physical vector registers in OoO back end.

> In the inner loop we multiply 10 rows by 16 columns (=2 SIMD columns).

> 10x16=160 accumulators occupy 160/8 = 20 vector registers.

> Additionally, we need 10 VRs for row inputs and 1 VR for column input (loaded

> with 16 different values throughout single iteration of the inner loop).

> So, overall we need 20+10+1=31 software-visible VRs, which is less than 32.

> Each of 160 accumulators is updated 8 times per one iteration of

> the loop, so there are 1280 FMAs per iteration == 160 SIMD FMAs.

> On each iteration we load 10x8=80 row inputs + 8*16=128 column

> inputs. Overall 208 double precision loads = 26 SIMD loads.

> FMA-to-load ratio = 160/26=6.15.

> Whether you call it "great" or just "large" is your call, but personally I don't expect that

> any CPU core that is still considered "CPU" rather than "accelerator" would ever have FMA-to-load

> ratio above 4, let alone, above 6. Right now, it seems, none of them exceeds 1.5.

>

>

> > a.k.a. tensor instructions.

>

> If by "tensor instructions" you mean "outer product" (I think, nowadays

> that's the most common meaning) than I tend to disagree.

> IMHO, many tasks can be accelerated quite nicely by "inner product" instructions. And intuitively,

> without being a chip architect, I would think that "inner product" is inherently more energy

> efficient than "outer product", due to lower fan out to register files.

>

> But, as shown above, neither is needed for specific task of [big] matrix multiply.

> Now, small or medium matrix multiply is completely different kettle of

> fish. BTW, much more important for me, from practical perspective.

>

> >

I agree that what you propose might be a simpler way to increase the performance of matrix operations.

Nevertheless, after looking at the SME instructions at

https://developer.arm.com/documentation/ddi0602/2021-06/SME-Instructions?lang=en

most of them implement various kinds of outer products, like the NVIDIA and Intel ISA extensions.

I assume that when they have selected the instructions to be included in this ISA extension, they have estimated the implementation cost, so the extra area and energy consumption should be reasonable, even if it is not certain that this is the most efficient way for increasing the throughput of matrix operations.

Topic | Posted By | Date |
---|---|---|

ARM Scalable Matrix Extension | dmcq | 2021/07/25 05:36 PM |

ARM Scalable Matrix Extension | Adrian | 2021/07/25 09:16 PM |

Sorry, typos | Adrian | 2021/07/25 10:32 PM |

ARM SVE Streaming Mode | Adrian | 2021/07/26 12:21 AM |

ARM SVE Streaming Mode | dmcq | 2021/07/26 04:18 AM |

ARM SVE Streaming Mode | Adrian | 2021/07/26 04:45 AM |

ARM Scalable Matrix Extension | Michael S | 2021/07/26 02:53 AM |

ARM Scalable Matrix Extension | Adrian | 2021/07/26 03:41 AM |

Inner & outer product | Adrian | 2021/07/26 03:52 AM |

ARM Scalable Matrix Extension | Rayla | 2021/07/26 05:08 AM |

ARM Scalable Matrix Extension | dmcq | 2021/07/26 05:38 AM |

ARM Scalable Matrix Extension | Doug S | 2021/07/26 11:38 AM |

ARM Scalable Matrix Extension | Brett | 2021/07/26 01:54 PM |

ARM Scalable Matrix Extension | --- | 2021/07/26 05:48 PM |

ARM Scalable Matrix Extension | dmcq | 2021/07/27 02:39 AM |

ARM Scalable Matrix Extension | Anon | 2021/07/26 06:08 AM |

ARM Scalable Matrix Extension | lkcl | 2022/07/28 03:38 PM |

ARM Scalable Matrix Extension | dmcq | 2022/07/29 02:24 PM |

ARM Scalable Matrix Extension | lkcl | 2022/07/29 03:44 PM |