By: Matt Lohmann (mlohmann.delete@this.noemail.com), May 13, 2022 1:32 pm
Room: Moderated Discussions
> NEON (built into each core) uses 128b registers, and there are 4 NEON engines (so, in a handwaving
> sense) 512b of NEON capability per cycle per P core, 256b of NEON capability per cycle per E core.
Since the same code can run on either a P core or E core, does the E core have registers for 512b of NEON capability but just the execution resources for 256b of NEON capability per cycle?
What is the benefit of the AMX engine in the P core cluster given that each P core has 512b of NEON capability per cycle? Is the benefit of AMX in the P core cluster two operations per cycle versus one 512b NEON operation per cycle?
It seems to me that it must be very difficult for a compiler to decide if some vector operation should be done in NEON or AMX. AMX is shared between 4 P cores and the 512b of NEON capability is inside each P core. The compiler needs to know what the other 3 P cores in a cluster are doing to know if it is better to use AMX or NEON.
> AMX (possibly in the M1 version, probably in the A15 version) appears capable of executing two operations
> per cycle, one load/store and one compute. But the compute operation may be a "fused" operation which
> is two independent vector operations that output results to independent rows of the array of PE's.
In what sense are the two independent vector operations “fused” if their outputs go to different places? That sounds like completely independent vector operations rather than something like a * b + c.
> sense) 512b of NEON capability per cycle per P core, 256b of NEON capability per cycle per E core.
Since the same code can run on either a P core or E core, does the E core have registers for 512b of NEON capability but just the execution resources for 256b of NEON capability per cycle?
What is the benefit of the AMX engine in the P core cluster given that each P core has 512b of NEON capability per cycle? Is the benefit of AMX in the P core cluster two operations per cycle versus one 512b NEON operation per cycle?
It seems to me that it must be very difficult for a compiler to decide if some vector operation should be done in NEON or AMX. AMX is shared between 4 P cores and the 512b of NEON capability is inside each P core. The compiler needs to know what the other 3 P cores in a cluster are doing to know if it is better to use AMX or NEON.
> AMX (possibly in the M1 version, probably in the A15 version) appears capable of executing two operations
> per cycle, one load/store and one compute. But the compute operation may be a "fused" operation which
> is two independent vector operations that output results to independent rows of the array of PE's.
In what sense are the two independent vector operations “fused” if their outputs go to different places? That sounds like completely independent vector operations rather than something like a * b + c.