By: Adrian (a.delete@this.acm.org), July 26, 2021 3:52 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on July 26, 2021 3:41 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on July 26, 2021 2:53 am wrote:
>
> > >
> >
> > > a.k.a. tensor instructions.
> >
> > If by "tensor instructions" you mean "outer product" (I think, nowadays
> > that's the most common meaning) than I tend to disagree.
> > IMHO, many tasks can be accelerated quite nicely by "inner product" instructions. And intuitively,
> > without being a chip architect, I would think that "inner product" is inherently more energy
> > efficient than "outer product", due to lower fan out to register files.
> >
> > But, as shown above, neither is needed for specific task of [big] matrix multiply.
> > Now, small or medium matrix multiply is completely different kettle of
> > fish. BTW, much more important for me, from practical perspective.
> >
You are right that the inner product is more energy efficient, so when the speed is less important, matrix operations should be decomposed into inner products, where possible.
However, the inner product has a latency that increases with the vector size, so for maximum speed it is frequently easier to use operations with lower latency, e.g. outer products or AXPY, even if they require more storage for the intermediate values.
> Michael S (already5chosen.delete@this.yahoo.com) on July 26, 2021 2:53 am wrote:
>
> > >
> >
> > > a.k.a. tensor instructions.
> >
> > If by "tensor instructions" you mean "outer product" (I think, nowadays
> > that's the most common meaning) than I tend to disagree.
> > IMHO, many tasks can be accelerated quite nicely by "inner product" instructions. And intuitively,
> > without being a chip architect, I would think that "inner product" is inherently more energy
> > efficient than "outer product", due to lower fan out to register files.
> >
> > But, as shown above, neither is needed for specific task of [big] matrix multiply.
> > Now, small or medium matrix multiply is completely different kettle of
> > fish. BTW, much more important for me, from practical perspective.
> >
You are right that the inner product is more energy efficient, so when the speed is less important, matrix operations should be decomposed into inner products, where possible.
However, the inner product has a latency that increases with the vector size, so for maximum speed it is frequently easier to use operations with lower latency, e.g. outer products or AXPY, even if they require more storage for the intermediate values.
Topic | Posted By | Date |
---|---|---|
ARM Scalable Matrix Extension | dmcq | 2021/07/25 05:36 PM |
ARM Scalable Matrix Extension | Adrian | 2021/07/25 09:16 PM |
Sorry, typos | Adrian | 2021/07/25 10:32 PM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 12:21 AM |
ARM SVE Streaming Mode | dmcq | 2021/07/26 04:18 AM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 04:45 AM |
ARM Scalable Matrix Extension | Michael S | 2021/07/26 02:53 AM |
ARM Scalable Matrix Extension | Adrian | 2021/07/26 03:41 AM |
Inner & outer product | Adrian | 2021/07/26 03:52 AM |
ARM Scalable Matrix Extension | Rayla | 2021/07/26 05:08 AM |
ARM Scalable Matrix Extension | dmcq | 2021/07/26 05:38 AM |
ARM Scalable Matrix Extension | Doug S | 2021/07/26 11:38 AM |
ARM Scalable Matrix Extension | Brett | 2021/07/26 01:54 PM |
ARM Scalable Matrix Extension | --- | 2021/07/26 05:48 PM |
ARM Scalable Matrix Extension | dmcq | 2021/07/27 02:39 AM |
ARM Scalable Matrix Extension | Anon | 2021/07/26 06:08 AM |
ARM Scalable Matrix Extension | lkcl | 2022/07/28 03:38 PM |
ARM Scalable Matrix Extension | dmcq | 2022/07/29 02:24 PM |
ARM Scalable Matrix Extension | lkcl | 2022/07/29 03:44 PM |