By: --- (---.delete@this.redheron.com), May 12, 2022 12:32 am
Room: Moderated Discussions
The investigation is not yet complete, but I've concluded that Apple has very interesting ambitions for AMX, way beyond what people imagine today.
The first round of patents describe basic AMX the way we think of it today, as an outer product engine that can be used to perform fast matrix multiply. Nice, but rather specialized.
The second round of patents describe ever fancier ways of this same outer product operation. So
- vectors can be fp64,32, 16 or int8
- vectors can even be very short (1, 2, 4 bits) routed through a lookup table. Very specialized ML stuff -- for now
- the vector that is used for the outer product can be specified in weird ways. Essentially imagine the pool of 8 512b X or Y vector registers as a single 8*64B very long X or Y register. Then you can specify a source as starting at a random place within this long vector, and proceeding with some stride
- masks
- you can even pack 2x2 matrices into these vectors for an alternate outer product mode (good for smaller matrices)
So lots of weird stuff, and easy to dismiss as basically strange requests by the ML team.
But the third round of patents include features like
- vector - vector operations (which will be 512b wide)
- extracting vectors from the Z result register
- executing simultaneous vector-vector instructions in parallel (as long as they write to different rows of the Z matrix)
- a new concern with the efficiency of the engines (which remember, live at the per-cluster level) in the face of aggressive simultaneous use by multiple cores.
To me the future seems clear as a somewhat interesting bifurcation.
While Intel put the full range of "vectors" into AVX512, Apple seems to have in mind something of a split, so
- NEON/SVE2 as the "latency/permuting" instruction set, probably only 256b, but able to slice and dice data within and between registers to your heart's content
- AMX as the "throughput" instruction set for matrix and "vertical" vector operations, basic stuff, few rearrangement options, but able to give you as many MACs as you wish.
Don't be surprised (I won't be) if AMX gets another shout-out at either WWDC or the September event...
I'm also now more sympathetic to Apple's current "keep everything behind API's" stance than I used to be. It's clear that every year they are rethinking what they can do so aggressively. The ISA has already become rather messy between generations one and two, let alone generation three. As far as I can tell, it's still the same ISA (creaking at the seams) but I expect at some point there will be a break and two very different versions of Accelerate will ship for the two different ISAs. I can now see why they don't want to be held back by compatibility issues; there are so many options that are opening up even with the current hardware, just as ways to get a lot more with just a minor tweak.
And, BTW, while the implementation requires matrix ops to cycle through the hardware four times on the E-cores, the vector-vector ops do not have to do this. In fact the E cores (at least to judge by the patents, though this may be at the A15 or even later level, not A14/M1) are capable of two full 512b vector-vector ops per cycle. An interesting contrast to Intel's on-going will-they won't-they vacillations around Alder Lake and AVX512 (which one suspects will only get worse in term of whether Intel's AMX is shared across the line, or kept for the high-end.)
(Oh, and as another thing that may one day get a dedicated shoutout -- did you know that Apple has its own VPU ala Movidius!? It clearly started as adding some smarts to the existing ISP; it's also now very clearly branched off into its own little world.)
The first round of patents describe basic AMX the way we think of it today, as an outer product engine that can be used to perform fast matrix multiply. Nice, but rather specialized.
The second round of patents describe ever fancier ways of this same outer product operation. So
- vectors can be fp64,32, 16 or int8
- vectors can even be very short (1, 2, 4 bits) routed through a lookup table. Very specialized ML stuff -- for now
- the vector that is used for the outer product can be specified in weird ways. Essentially imagine the pool of 8 512b X or Y vector registers as a single 8*64B very long X or Y register. Then you can specify a source as starting at a random place within this long vector, and proceeding with some stride
- masks
- you can even pack 2x2 matrices into these vectors for an alternate outer product mode (good for smaller matrices)
So lots of weird stuff, and easy to dismiss as basically strange requests by the ML team.
But the third round of patents include features like
- vector - vector operations (which will be 512b wide)
- extracting vectors from the Z result register
- executing simultaneous vector-vector instructions in parallel (as long as they write to different rows of the Z matrix)
- a new concern with the efficiency of the engines (which remember, live at the per-cluster level) in the face of aggressive simultaneous use by multiple cores.
To me the future seems clear as a somewhat interesting bifurcation.
While Intel put the full range of "vectors" into AVX512, Apple seems to have in mind something of a split, so
- NEON/SVE2 as the "latency/permuting" instruction set, probably only 256b, but able to slice and dice data within and between registers to your heart's content
- AMX as the "throughput" instruction set for matrix and "vertical" vector operations, basic stuff, few rearrangement options, but able to give you as many MACs as you wish.
Don't be surprised (I won't be) if AMX gets another shout-out at either WWDC or the September event...
I'm also now more sympathetic to Apple's current "keep everything behind API's" stance than I used to be. It's clear that every year they are rethinking what they can do so aggressively. The ISA has already become rather messy between generations one and two, let alone generation three. As far as I can tell, it's still the same ISA (creaking at the seams) but I expect at some point there will be a break and two very different versions of Accelerate will ship for the two different ISAs. I can now see why they don't want to be held back by compatibility issues; there are so many options that are opening up even with the current hardware, just as ways to get a lot more with just a minor tweak.
And, BTW, while the implementation requires matrix ops to cycle through the hardware four times on the E-cores, the vector-vector ops do not have to do this. In fact the E cores (at least to judge by the patents, though this may be at the A15 or even later level, not A14/M1) are capable of two full 512b vector-vector ops per cycle. An interesting contrast to Intel's on-going will-they won't-they vacillations around Alder Lake and AVX512 (which one suspects will only get worse in term of whether Intel's AMX is shared across the line, or kept for the high-end.)
(Oh, and as another thing that may one day get a dedicated shoutout -- did you know that Apple has its own VPU ala Movidius!? It clearly started as adding some smarts to the existing ISP; it's also now very clearly branched off into its own little world.)