By: Adrian (a.delete@this.acm.org), July 26, 2021 4:45 am
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on July 26, 2021 4:18 am wrote:
> Adrian (a.delete@this.acm.org) on July 26, 2021 12:21 am wrote:
> >
> >
> > So the SVE Streaming Mode, by switching the apparent vector register length, will allow the choice
> > between 2 sets of vector instructions, one with low-latency instructions processing few data per
> > instruction and one with high-latency instructions processing many data per instruction.
> >
> > The high-latency instructions available in the "SVE Streaming Mode" will make
> > it easier to achieve the maximum throughput, by requiring fewer instructions
> > to do the work, but obviously they are not suitable for every task.
> >
> > The ability to change modes to get the desired compromise between latency and throughput is
> > certainly very valuable, as long as switching modes will not require an excessive time.
> >
> > I assume that the ABI will specify that the non-streaming SVE mode is default, so any procedure needing
> > the streaming mode will have to switch modes upon function entry and function exit, so it will have to
> > gain enough from the high-latency complex instructions to recover the time lost with mode switching.
>
> It does sound like they want to do that. However high latency isn't quite as easy as
> just having longer vectors. If their streaming operations have to be completed before
> dependent operations and take four cache operations per load or store then for
>
> Load v1
> load v2
> add v3,v1,v2
> store v3
>
> there would be no load or store while the add is performed, whereas if there
> is a loop doing these operations with OoO the re would be no wasted load store
> time. And the streaming only gets worse if you do more arithmetic.
>
> I think they'd have to implemet something like their Helium
> Enhancing the Capabilities of the Smallest Devices
> to avoid that sort of problem and get an overall gain. Now I think their implementation
> of that is amazing and it goes into Cortex-M - but it sounds to me like mixing it up
> with OoO would definitely make control more complex! Complexificationization I'd call
> it. :-) They must be pretty certain of their design and verificaton tools!
Because the array ZA, which stores the results appears to have 4 or 8 tiles (depending on the data type), which can be the destinations of outer products or memory loads, or the sources of memory stores, I assume that you have to overlap memory loads, memory stores and arithmetic operations that use different ZA tiles, in order to reach the maximum throughput.
I suppose that when the SME manual will be published, before the end of 2021, it will say whether it is possible to partially overlap in time operations targeting the same tile, or you have to wait until the previous operation is completed for the last tile element, before doing anything with the first tile element.
> Adrian (a.delete@this.acm.org) on July 26, 2021 12:21 am wrote:
> >
> >
> > So the SVE Streaming Mode, by switching the apparent vector register length, will allow the choice
> > between 2 sets of vector instructions, one with low-latency instructions processing few data per
> > instruction and one with high-latency instructions processing many data per instruction.
> >
> > The high-latency instructions available in the "SVE Streaming Mode" will make
> > it easier to achieve the maximum throughput, by requiring fewer instructions
> > to do the work, but obviously they are not suitable for every task.
> >
> > The ability to change modes to get the desired compromise between latency and throughput is
> > certainly very valuable, as long as switching modes will not require an excessive time.
> >
> > I assume that the ABI will specify that the non-streaming SVE mode is default, so any procedure needing
> > the streaming mode will have to switch modes upon function entry and function exit, so it will have to
> > gain enough from the high-latency complex instructions to recover the time lost with mode switching.
>
> It does sound like they want to do that. However high latency isn't quite as easy as
> just having longer vectors. If their streaming operations have to be completed before
> dependent operations and take four cache operations per load or store then for
>
> Load v1
> load v2
> add v3,v1,v2
> store v3
>
> there would be no load or store while the add is performed, whereas if there
> is a loop doing these operations with OoO the re would be no wasted load store
> time. And the streaming only gets worse if you do more arithmetic.
>
> I think they'd have to implemet something like their Helium
> Enhancing the Capabilities of the Smallest Devices
> to avoid that sort of problem and get an overall gain. Now I think their implementation
> of that is amazing and it goes into Cortex-M - but it sounds to me like mixing it up
> with OoO would definitely make control more complex! Complexificationization I'd call
> it. :-) They must be pretty certain of their design and verificaton tools!
Because the array ZA, which stores the results appears to have 4 or 8 tiles (depending on the data type), which can be the destinations of outer products or memory loads, or the sources of memory stores, I assume that you have to overlap memory loads, memory stores and arithmetic operations that use different ZA tiles, in order to reach the maximum throughput.
I suppose that when the SME manual will be published, before the end of 2021, it will say whether it is possible to partially overlap in time operations targeting the same tile, or you have to wait until the previous operation is completed for the last tile element, before doing anything with the first tile element.
Topic | Posted By | Date |
---|---|---|
ARM Scalable Matrix Extension | dmcq | 2021/07/25 05:36 PM |
ARM Scalable Matrix Extension | Adrian | 2021/07/25 09:16 PM |
Sorry, typos | Adrian | 2021/07/25 10:32 PM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 12:21 AM |
ARM SVE Streaming Mode | dmcq | 2021/07/26 04:18 AM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 04:45 AM |
ARM Scalable Matrix Extension | Michael S | 2021/07/26 02:53 AM |
ARM Scalable Matrix Extension | Adrian | 2021/07/26 03:41 AM |
Inner & outer product | Adrian | 2021/07/26 03:52 AM |
ARM Scalable Matrix Extension | Rayla | 2021/07/26 05:08 AM |
ARM Scalable Matrix Extension | dmcq | 2021/07/26 05:38 AM |
ARM Scalable Matrix Extension | Doug S | 2021/07/26 11:38 AM |
ARM Scalable Matrix Extension | Brett | 2021/07/26 01:54 PM |
ARM Scalable Matrix Extension | --- | 2021/07/26 05:48 PM |
ARM Scalable Matrix Extension | dmcq | 2021/07/27 02:39 AM |
ARM Scalable Matrix Extension | Anon | 2021/07/26 06:08 AM |
ARM Scalable Matrix Extension | lkcl | 2022/07/28 03:38 PM |
ARM Scalable Matrix Extension | dmcq | 2022/07/29 02:24 PM |
ARM Scalable Matrix Extension | lkcl | 2022/07/29 03:44 PM |