By: lkcl (luke.leighton.delete@this.gmail.com), July 29, 2022 3:44 pm
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on July 29, 2022 2:24 pm wrote:
> lkcl (luke.leighton.delete@this.gmail.com) on July 28, 2022 3:38 pm wrote:
> As far as I can see they have two predicates controlling which elements
> are active in a tile so yes they should be able to do that sort of thing.
okaay, so the actual hardware size is power-2 boundaries and masks
can be arbitrary?
https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/SMOPA--Signed-integer-sum-of-outer-products-and-accumulate-?lang=en
Is the name of the first governing scalable predicate register P0-P7,
encoded in the "Pn" field.
Is the name of the second governing scalable predicate register P0-P7,
encoded in the "Pm" field.
then, both predicate n and m bits have to be true for the fmadd/fmsub
to take place.
FMOPA seems to be the same:
https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en
if ElemP[mask1, row, esize] == '1' &&
ElemP[mask2, col, esize] == '1' then
Elem[result, row*dim+col, esize] = FPMulAdd_ZA(element3, element1, element2, FPCR[]);
else
Elem[result, row*dim+col, esize] = element3;
hang on... that's still writing results out. which would mean rather
that the masks would be useful to detect zeros, in advance of performing
the outer-product.
(on the reasonable basis that doing result += 0.0*M or += N*0.0 is
wasting CPU cycles).
so i would expect a "pre-zero-detection-phase" to be run prior to
calling this instruction.
which still does not entirrrely determine whether VL can be non-power-of-two
constant integer esize = 32; # 64 for double-precision
constant integer VL = CurrentVL;
constant integer PL = VL DIV 8;
constant integer dim = VL DIV esize;
for row = 0 to dim-1
for col = 0 to dim-1
it all looks very odd, to me - row/col dimensions fixed at the same
size? if you click on the link CurrentVL it takes you here:
https://developer.arm.com/documentation/ddi0602/2022-06/Shared-Pseudocode/AArch64-Functions?lang=en#impl-aarch64.CurrentVL.read.none
integer CurrentVL
return if HaveSME() && PSTATE.SM == '1' then SVL else NVL;
err... now you have to track both/either SVL and NVL...
SVL takes you through to something called "Streaming SVL"
and looking up SMCR_EL1/2/3
https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/SMCR-EL2--SME-Control-Register--EL2-?lang=en
Constrains the effective Streaming SVE vector register length
for EL2, EL1, and EL0 to (LEN+1)*128 bits
ahhhhh, there we have it: multiples of 128 bits. ta-daaa.
likewise for that NVL thing, that tracks through to ZCR_EL2
https://developer.arm.com/documentation/ddi0595/2020-12/AArch64-Registers/ZCR-EL2--SVE-Control-Register--EL2-
Constrains the effective scalable vector register length
for EL2, EL1, and EL0 to (LEN+1)x128 bits
ok so my brain was clearly melted by the pseudocode on initial
glances but a day after getting over the shock it seems readable :)
conclusion: the tile sizes are power-of-two boundaried, depending
on the silicon-partner's choice of vector size (128..1024 in steps
of 128), meaning that tiles (which have to be square) are also
power-of-two boundaried, a silicon-partner choice of 128 would result
in tiles being 2x2 for 64-bit operations and 4x4 for 32-bit operations.
thank you for waking me up to the two predicate-mask sources, dmcq,
i missed what was right in front of my nose, yesterday.
l.
> lkcl (luke.leighton.delete@this.gmail.com) on July 28, 2022 3:38 pm wrote:
> As far as I can see they have two predicates controlling which elements
> are active in a tile so yes they should be able to do that sort of thing.
okaay, so the actual hardware size is power-2 boundaries and masks
can be arbitrary?
https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/SMOPA--Signed-integer-sum-of-outer-products-and-accumulate-?lang=en
Is the name of the first governing scalable predicate register P0-P7,
encoded in the "Pn" field.
Is the name of the second governing scalable predicate register P0-P7,
encoded in the "Pm" field.
then, both predicate n and m bits have to be true for the fmadd/fmsub
to take place.
FMOPA seems to be the same:
https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-?lang=en
if ElemP[mask1, row, esize] == '1' &&
ElemP[mask2, col, esize] == '1' then
Elem[result, row*dim+col, esize] = FPMulAdd_ZA(element3, element1, element2, FPCR[]);
else
Elem[result, row*dim+col, esize] = element3;
hang on... that's still writing results out. which would mean rather
that the masks would be useful to detect zeros, in advance of performing
the outer-product.
(on the reasonable basis that doing result += 0.0*M or += N*0.0 is
wasting CPU cycles).
so i would expect a "pre-zero-detection-phase" to be run prior to
calling this instruction.
which still does not entirrrely determine whether VL can be non-power-of-two
constant integer esize = 32; # 64 for double-precision
constant integer VL = CurrentVL;
constant integer PL = VL DIV 8;
constant integer dim = VL DIV esize;
for row = 0 to dim-1
for col = 0 to dim-1
it all looks very odd, to me - row/col dimensions fixed at the same
size? if you click on the link CurrentVL it takes you here:
https://developer.arm.com/documentation/ddi0602/2022-06/Shared-Pseudocode/AArch64-Functions?lang=en#impl-aarch64.CurrentVL.read.none
integer CurrentVL
return if HaveSME() && PSTATE.SM == '1' then SVL else NVL;
err... now you have to track both/either SVL and NVL...
SVL takes you through to something called "Streaming SVL"
and looking up SMCR_EL1/2/3
https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/SMCR-EL2--SME-Control-Register--EL2-?lang=en
Constrains the effective Streaming SVE vector register length
for EL2, EL1, and EL0 to (LEN+1)*128 bits
ahhhhh, there we have it: multiples of 128 bits. ta-daaa.
likewise for that NVL thing, that tracks through to ZCR_EL2
https://developer.arm.com/documentation/ddi0595/2020-12/AArch64-Registers/ZCR-EL2--SVE-Control-Register--EL2-
Constrains the effective scalable vector register length
for EL2, EL1, and EL0 to (LEN+1)x128 bits
ok so my brain was clearly melted by the pseudocode on initial
glances but a day after getting over the shock it seems readable :)
conclusion: the tile sizes are power-of-two boundaried, depending
on the silicon-partner's choice of vector size (128..1024 in steps
of 128), meaning that tiles (which have to be square) are also
power-of-two boundaried, a silicon-partner choice of 128 would result
in tiles being 2x2 for 64-bit operations and 4x4 for 32-bit operations.
thank you for waking me up to the two predicate-mask sources, dmcq,
i missed what was right in front of my nose, yesterday.
l.
Topic | Posted By | Date |
---|---|---|
ARM Scalable Matrix Extension | dmcq | 2021/07/25 05:36 PM |
ARM Scalable Matrix Extension | Adrian | 2021/07/25 09:16 PM |
Sorry, typos | Adrian | 2021/07/25 10:32 PM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 12:21 AM |
ARM SVE Streaming Mode | dmcq | 2021/07/26 04:18 AM |
ARM SVE Streaming Mode | Adrian | 2021/07/26 04:45 AM |
ARM Scalable Matrix Extension | Michael S | 2021/07/26 02:53 AM |
ARM Scalable Matrix Extension | Adrian | 2021/07/26 03:41 AM |
Inner & outer product | Adrian | 2021/07/26 03:52 AM |
ARM Scalable Matrix Extension | Rayla | 2021/07/26 05:08 AM |
ARM Scalable Matrix Extension | dmcq | 2021/07/26 05:38 AM |
ARM Scalable Matrix Extension | Doug S | 2021/07/26 11:38 AM |
ARM Scalable Matrix Extension | Brett | 2021/07/26 01:54 PM |
ARM Scalable Matrix Extension | --- | 2021/07/26 05:48 PM |
ARM Scalable Matrix Extension | dmcq | 2021/07/27 02:39 AM |
ARM Scalable Matrix Extension | Anon | 2021/07/26 06:08 AM |
ARM Scalable Matrix Extension | lkcl | 2022/07/28 03:38 PM |
ARM Scalable Matrix Extension | dmcq | 2022/07/29 02:24 PM |
ARM Scalable Matrix Extension | lkcl | 2022/07/29 03:44 PM |