By: --- (---.delete@this.redheron.com), May 13, 2022 2:09 pm
Room: Moderated Discussions
Matt Lohmann (mlohmann.delete@this.noemail.com) on May 13, 2022 1:32 pm wrote:
> > NEON (built into each core) uses 128b registers, and there are 4 NEON engines (so, in a handwaving
> > sense) 512b of NEON capability per cycle per P core, 256b of NEON capability per cycle per E core.
>
> Since the same code can run on either a P core or E core, does the E core have registers for 512b
> of NEON capability but just the execution resources for 256b of NEON capability per cycle?
>
> What is the benefit of the AMX engine in the P core cluster given that each P
> core has 512b of NEON capability per cycle? Is the benefit of AMX in the P core
> cluster two operations per cycle versus one 512b NEON operation per cycle?
>
> It seems to me that it must be very difficult for a compiler to decide if some vector
> operation should be done in NEON or AMX. AMX is shared between 4 P cores and the 512b
> of NEON capability is inside each P core. The compiler needs to know what the other
> 3 P cores in a cluster are doing to know if it is better to use AMX or NEON.
>
> > AMX (possibly in the M1 version, probably in the A15 version) appears capable of executing two operations
> > per cycle, one load/store and one compute. But the compute operation may be a "fused" operation which
> > is two independent vector operations that output results to independent rows of the array of PE's.
>
> In what sense are the two independent vector operations “fused” if their outputs go to different places?
> That sounds like completely independent vector operations rather than something like a * b + c.
These are very basic question and, excuse me for being blunt, answering them is not a good use of my limited time.
You are better off reading various intro material on the web. A search for Apple AMX will, for example, bring up a lot of material.
If you want to know full details (of AMX and much more besides) go read my three volumes at
https://github.com/name99-org/AArch64-Explore
(the three PDFs).
Volume 3 (the version at github right now) has an AMX section, limited, but enough to cover your basic questions. In a month or three I'll publish a new version which, among other things includes the newest AVX stuff I have discovered.
> > NEON (built into each core) uses 128b registers, and there are 4 NEON engines (so, in a handwaving
> > sense) 512b of NEON capability per cycle per P core, 256b of NEON capability per cycle per E core.
>
> Since the same code can run on either a P core or E core, does the E core have registers for 512b
> of NEON capability but just the execution resources for 256b of NEON capability per cycle?
>
> What is the benefit of the AMX engine in the P core cluster given that each P
> core has 512b of NEON capability per cycle? Is the benefit of AMX in the P core
> cluster two operations per cycle versus one 512b NEON operation per cycle?
>
> It seems to me that it must be very difficult for a compiler to decide if some vector
> operation should be done in NEON or AMX. AMX is shared between 4 P cores and the 512b
> of NEON capability is inside each P core. The compiler needs to know what the other
> 3 P cores in a cluster are doing to know if it is better to use AMX or NEON.
>
> > AMX (possibly in the M1 version, probably in the A15 version) appears capable of executing two operations
> > per cycle, one load/store and one compute. But the compute operation may be a "fused" operation which
> > is two independent vector operations that output results to independent rows of the array of PE's.
>
> In what sense are the two independent vector operations “fused” if their outputs go to different places?
> That sounds like completely independent vector operations rather than something like a * b + c.
These are very basic question and, excuse me for being blunt, answering them is not a good use of my limited time.
You are better off reading various intro material on the web. A search for Apple AMX will, for example, bring up a lot of material.
If you want to know full details (of AMX and much more besides) go read my three volumes at
https://github.com/name99-org/AArch64-Explore
(the three PDFs).
Volume 3 (the version at github right now) has an AMX section, limited, but enough to cover your basic questions. In a month or three I'll publish a new version which, among other things includes the newest AVX stuff I have discovered.