By: --- (---.delete@this.redheron.com), May 13, 2022 9:33 am
Room: Moderated Discussions
Matt Lohmann (mlohmann.delete@this.noemail.com) on May 13, 2022 5:21 am wrote:
> > In fact the E cores (at least to judge by the patents, though this may be at the A15 or
> > even later level, not A14/M1) are capable of two full 512b vector-vector ops per cycle.
>
> Is the performance per clock cycle of the per-cluster vector engine for the E cores different from
> the performance per clock cycle of the per-cluster vector engine for the P cores in the Apple M1?
>
> How many 512-bit vector arithmetic operations per clock cycle can be performed
> by the per-cluster vector engine for the P cores in the Apple M1?
>
> Can the per-cluster vector engines in the Apple M1 overlap
> 512-bit loads and stores with arithmetic operations?
>
> Are fp64 fused multiply-adds (a * b + c) supported in either the
> per-cluster vector engines or the CPU cores in the Apple M1?
>
> Are 256-bit vector operations performed in each P core or only in the
> shared vector engine for a cluster of P cores in the Apple M1?
>
> Thank you in advance to anyone who can answer any of these questions.
You're asking questions about a direction that is indicated by the patents, and suggested by what we have seen or have tested of AMX, but no more than that. As such you can't expect answers that are any more than guesses.
The baseline P AMX engine is
4 vertically stacked super-PE's (processing engines).
Each PE is a rectangle of 8*2 processing engines (PEs).
So this gives you an array of 8*8 PEs.
The E AMX is (good enough) a single such super PE, so an array if 8*2 PE's.
A PE is a single unit that can perform an fp64 multiply-add, or 2 fp32 MACs, or (I think four) fp16s and four int16 or int8 mac's (plus some associated storage).
So if you do the arithmetic an E core has
- two rows of MACs
- each MACs can handle 64b, and there are 8 of them, so each row is 512b wide.
The AMX engine is right now very much set up as a coprocessor, one engine per cluster.
NEON (built into each core) uses 128b registers, and there are 4 NEON engines (so, in a handwaving sense) 512b of NEON capability per cycle per P core, 256b of NEON capability per cycle per E core.
SVE2 is *probably* coming to Apple this year with A16 and M2, and will *probably* feature 256b wide registers. But if you are evaluating SVE2 based on register width, you're misunderstanding where the value of SVE2 lies.
AMX (possibly in the M1 version, probably in the A15 version) appears capable of executing two operations per cycle, one load/store and one compute. But the compute operation may be a "fused" operation which is two independent vector operations that output results to independent rows of the array of PE's.
HOWEVER that's not as good as it sounds. Remember that AMX instructions execute on a core through the LSU, specifically through the store pipes. P core has two store pipes, E core has one. That means that, as throughput, P can only pump out two AMX instructions per cycle, E can only pump out one. Right now this is probably the biggest throughput weakness in the design, and something I expect to evolve soon. (An easy quick improvement would be to move the P core from the current design
- 2 load pipe
- 1 store pipe
- 1 ambidextrous pipe
to
- 1 load pipe
- 1 store pipe
- 2 ambidextrous pipes
which gives a possible throughput of 3 stores/AMX per cycle.
Transport of instructions from a core to the AMX unit is less of a concern because AMX operations are bundled into as many as will fit into the wide bus between a core and L2. That bundling number appears to be around 6 or 7 (some can build up in a queue if that bus is being used for other L1L2 transactions).
But, like I said, these are inappropriate questions right now.
If you want to understand the scheme and the direction, look at the patents, eg
a recent one with a good overview is (2019) https://patents.google.com/patent/US20200272597A1
If you are planning to build something huge and need this data, talk to Apple.
If you're simply curious, well, wait for WWDC then the September event!
> > In fact the E cores (at least to judge by the patents, though this may be at the A15 or
> > even later level, not A14/M1) are capable of two full 512b vector-vector ops per cycle.
>
> Is the performance per clock cycle of the per-cluster vector engine for the E cores different from
> the performance per clock cycle of the per-cluster vector engine for the P cores in the Apple M1?
>
> How many 512-bit vector arithmetic operations per clock cycle can be performed
> by the per-cluster vector engine for the P cores in the Apple M1?
>
> Can the per-cluster vector engines in the Apple M1 overlap
> 512-bit loads and stores with arithmetic operations?
>
> Are fp64 fused multiply-adds (a * b + c) supported in either the
> per-cluster vector engines or the CPU cores in the Apple M1?
>
> Are 256-bit vector operations performed in each P core or only in the
> shared vector engine for a cluster of P cores in the Apple M1?
>
> Thank you in advance to anyone who can answer any of these questions.
You're asking questions about a direction that is indicated by the patents, and suggested by what we have seen or have tested of AMX, but no more than that. As such you can't expect answers that are any more than guesses.
The baseline P AMX engine is
4 vertically stacked super-PE's (processing engines).
Each PE is a rectangle of 8*2 processing engines (PEs).
So this gives you an array of 8*8 PEs.
The E AMX is (good enough) a single such super PE, so an array if 8*2 PE's.
A PE is a single unit that can perform an fp64 multiply-add, or 2 fp32 MACs, or (I think four) fp16s and four int16 or int8 mac's (plus some associated storage).
So if you do the arithmetic an E core has
- two rows of MACs
- each MACs can handle 64b, and there are 8 of them, so each row is 512b wide.
The AMX engine is right now very much set up as a coprocessor, one engine per cluster.
NEON (built into each core) uses 128b registers, and there are 4 NEON engines (so, in a handwaving sense) 512b of NEON capability per cycle per P core, 256b of NEON capability per cycle per E core.
SVE2 is *probably* coming to Apple this year with A16 and M2, and will *probably* feature 256b wide registers. But if you are evaluating SVE2 based on register width, you're misunderstanding where the value of SVE2 lies.
AMX (possibly in the M1 version, probably in the A15 version) appears capable of executing two operations per cycle, one load/store and one compute. But the compute operation may be a "fused" operation which is two independent vector operations that output results to independent rows of the array of PE's.
HOWEVER that's not as good as it sounds. Remember that AMX instructions execute on a core through the LSU, specifically through the store pipes. P core has two store pipes, E core has one. That means that, as throughput, P can only pump out two AMX instructions per cycle, E can only pump out one. Right now this is probably the biggest throughput weakness in the design, and something I expect to evolve soon. (An easy quick improvement would be to move the P core from the current design
- 2 load pipe
- 1 store pipe
- 1 ambidextrous pipe
to
- 1 load pipe
- 1 store pipe
- 2 ambidextrous pipes
which gives a possible throughput of 3 stores/AMX per cycle.
Transport of instructions from a core to the AMX unit is less of a concern because AMX operations are bundled into as many as will fit into the wide bus between a core and L2. That bundling number appears to be around 6 or 7 (some can build up in a queue if that bus is being used for other L1L2 transactions).
But, like I said, these are inappropriate questions right now.
If you want to understand the scheme and the direction, look at the patents, eg
a recent one with a good overview is (2019) https://patents.google.com/patent/US20200272597A1
If you are planning to build something huge and need this data, talk to Apple.
If you're simply curious, well, wait for WWDC then the September event!