By: Anon1 (anon.delete@this.anon.com), May 13, 2022 3:55 pm
Room: Moderated Discussions
Matt Lohmann (mlohmann.delete@this.noemail.com) on May 13, 2022 1:32 pm wrote:
> Since the same code can run on either a P core or E core, does the E core have registers for 512b
> of NEON capability but just the execution resources for 256b of NEON capability per cycle?
NEON defines 32 128bit architectural registers. The rest is up to the microarchitectural details.
> What is the benefit of the AMX engine in the P core cluster given that each P
> core has 512b of NEON capability per cycle? Is the benefit of AMX in the P core
> cluster two operations per cycle versus one 512b NEON operation per cycle?
AMX has higher throughput for certain operations. That’s why it’s there.
> It seems to me that it must be very difficult for a compiler to decide if some vector
> operation should be done in NEON or AMX. AMX is shared between 4 P cores and the 512b
> of NEON capability is inside each P core. The compiler needs to know what the other
> 3 P cores in a cluster are doing to know if it is better to use AMX or NEON.
>
Compiler doesn’t decide anything, as it doesn’t generate AMX instructions. Apple does not directly expose those instructions, they are a private implementation detail. You can access the AMX coprocessor by using Apples numerical libraries.
> Since the same code can run on either a P core or E core, does the E core have registers for 512b
> of NEON capability but just the execution resources for 256b of NEON capability per cycle?
NEON defines 32 128bit architectural registers. The rest is up to the microarchitectural details.
> What is the benefit of the AMX engine in the P core cluster given that each P
> core has 512b of NEON capability per cycle? Is the benefit of AMX in the P core
> cluster two operations per cycle versus one 512b NEON operation per cycle?
AMX has higher throughput for certain operations. That’s why it’s there.
> It seems to me that it must be very difficult for a compiler to decide if some vector
> operation should be done in NEON or AMX. AMX is shared between 4 P cores and the 512b
> of NEON capability is inside each P core. The compiler needs to know what the other
> 3 P cores in a cluster are doing to know if it is better to use AMX or NEON.
>
Compiler doesn’t decide anything, as it doesn’t generate AMX instructions. Apple does not directly expose those instructions, they are a private implementation detail. You can access the AMX coprocessor by using Apples numerical libraries.