ARM Scalable Matrix Extension

By: lkcl (, July 29, 2022 3:44 pm
Room: Moderated Discussions
dmcq ( on July 29, 2022 2:24 pm wrote:
> lkcl ( on July 28, 2022 3:38 pm wrote:
> As far as I can see they have two predicates controlling which elements
> are active in a tile so yes they should be able to do that sort of thing.

okaay, so the actual hardware size is power-2 boundaries and masks
can be arbitrary?

Is the name of the first governing scalable predicate register P0-P7,
encoded in the "Pn" field.
Is the name of the second governing scalable predicate register P0-P7,
encoded in the "Pm" field.

then, both predicate n and m bits have to be true for the fmadd/fmsub
to take place.

FMOPA seems to be the same:

if ElemP[mask1, row, esize] == '1' &&
ElemP[mask2, col, esize] == '1' then
Elem[result, row*dim+col, esize] = FPMulAdd_ZA(element3, element1, element2, FPCR[]);
Elem[result, row*dim+col, esize] = element3;

hang on... that's still writing results out. which would mean rather
that the masks would be useful to detect zeros, in advance of performing
the outer-product.

(on the reasonable basis that doing result += 0.0*M or += N*0.0 is
wasting CPU cycles).

so i would expect a "pre-zero-detection-phase" to be run prior to
calling this instruction.

which still does not entirrrely determine whether VL can be non-power-of-two

constant integer esize = 32; # 64 for double-precision
constant integer VL = CurrentVL;
constant integer PL = VL DIV 8;
constant integer dim = VL DIV esize;

for row = 0 to dim-1
for col = 0 to dim-1

it all looks very odd, to me - row/col dimensions fixed at the same
size? if you click on the link CurrentVL it takes you here:

integer CurrentVL
return if HaveSME() && PSTATE.SM == '1' then SVL else NVL;

err... now you have to track both/either SVL and NVL...
SVL takes you through to something called "Streaming SVL"
and looking up SMCR_EL1/2/3

Constrains the effective Streaming SVE vector register length
for EL2, EL1, and EL0 to (LEN+1)*128 bits

ahhhhh, there we have it: multiples of 128 bits. ta-daaa.
likewise for that NVL thing, that tracks through to ZCR_EL2

Constrains the effective scalable vector register length
for EL2, EL1, and EL0 to (LEN+1)x128 bits

ok so my brain was clearly melted by the pseudocode on initial
glances but a day after getting over the shock it seems readable :)

conclusion: the tile sizes are power-of-two boundaried, depending
on the silicon-partner's choice of vector size (128..1024 in steps
of 128), meaning that tiles (which have to be square) are also
power-of-two boundaried, a silicon-partner choice of 128 would result
in tiles being 2x2 for 64-bit operations and 4x4 for 32-bit operations.

thank you for waking me up to the two predicate-mask sources, dmcq,
i missed what was right in front of my nose, yesterday.

< Previous Post in Thread 
TopicPosted ByDate
ARM Scalable Matrix Extensiondmcq2021/07/25 05:36 PM
  ARM Scalable Matrix ExtensionAdrian2021/07/25 09:16 PM
    Sorry, typosAdrian2021/07/25 10:32 PM
    ARM SVE Streaming ModeAdrian2021/07/26 12:21 AM
      ARM SVE Streaming Modedmcq2021/07/26 04:18 AM
        ARM SVE Streaming ModeAdrian2021/07/26 04:45 AM
    ARM Scalable Matrix ExtensionMichael S2021/07/26 02:53 AM
      ARM Scalable Matrix ExtensionAdrian2021/07/26 03:41 AM
        Inner & outer productAdrian2021/07/26 03:52 AM
      ARM Scalable Matrix ExtensionRayla2021/07/26 05:08 AM
      ARM Scalable Matrix Extensiondmcq2021/07/26 05:38 AM
        ARM Scalable Matrix ExtensionDoug S2021/07/26 11:38 AM
          ARM Scalable Matrix ExtensionBrett2021/07/26 01:54 PM
            ARM Scalable Matrix Extension---2021/07/26 05:48 PM
              ARM Scalable Matrix Extensiondmcq2021/07/27 02:39 AM
      ARM Scalable Matrix ExtensionAnon2021/07/26 06:08 AM
    ARM Scalable Matrix Extensionlkcl2022/07/28 03:38 PM
      ARM Scalable Matrix Extensiondmcq2022/07/29 02:24 PM
        ARM Scalable Matrix Extensionlkcl2022/07/29 03:44 PM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊