By: --- (---.delete@this.redheron.com), May 13, 2022 9:03 pm
Room: Moderated Discussions
Matt Lomann (mlohmann.delete@this.noemail.com) on May 13, 2022 7:02 pm wrote:
> Thank you. That makes sense. The complexity of these modern processors is shocking. It’s a
> miracle they ever work at all. Hiding the AMX instructions behind an API allows Apple to fix
> some hardware bugs with software, such as by not using a particular sequence of instructions.
Bugs are always possible, but a more likely reason for hiding the direct API is, as I have already said, that Apple does not want to be slowed down in the evolution of AMX.
It's already clear that there are limitations to the existing instruction design (the way to specify extracting an arbitrary vector from the set of vector registers was added later and is definitely not ideal). My guess is they will continue with an opaque ISA, and somewhat clumsy retrofits of new functionality, until the ISA can take it no further, then they'll switch to a new ISA (essentially same functionality, same design, just a better set of instruction primitives). I see no reason that they want to keep the instructions secret forever (what's in it for them?) but certainly while the functionality is rapidly evolving every year...
> The AMX engine must run at the same clock frequency as the 4 P cores or 2 E cores it is connected to. If one
> P core starts doing a lot of AMX operations, I wonder if the clock frequency of the remaining 3 P cores gets
> reduced, sort of like the clock frequency gets reduced when using AVX512 instructions on a Xeon processor.
Yes, everything in a cluster runs at the same frequency.
There is no reason to believe that AMX reduces the frequency noticeably beyond the performance available when 4 P cores are running. No-one has reported this and a rough estimate of AMX performance (number of MACs, size of problem, blah blah) suggests that if frequency is reduced is by something minimal, not worth worrying about.
As I've said before, both AMX and NEON accept that they are used primarily in throughput tasks and don't try for stretch latency reduction (not just no 5.5GHz, but also every neon op takes at least two cycles, I think the minimum for AMX might be four cycles). This means smaller transistors being driven a lot less hard, so much less localized heating.
> Thank you. That makes sense. The complexity of these modern processors is shocking. It’s a
> miracle they ever work at all. Hiding the AMX instructions behind an API allows Apple to fix
> some hardware bugs with software, such as by not using a particular sequence of instructions.
Bugs are always possible, but a more likely reason for hiding the direct API is, as I have already said, that Apple does not want to be slowed down in the evolution of AMX.
It's already clear that there are limitations to the existing instruction design (the way to specify extracting an arbitrary vector from the set of vector registers was added later and is definitely not ideal). My guess is they will continue with an opaque ISA, and somewhat clumsy retrofits of new functionality, until the ISA can take it no further, then they'll switch to a new ISA (essentially same functionality, same design, just a better set of instruction primitives). I see no reason that they want to keep the instructions secret forever (what's in it for them?) but certainly while the functionality is rapidly evolving every year...
> The AMX engine must run at the same clock frequency as the 4 P cores or 2 E cores it is connected to. If one
> P core starts doing a lot of AMX operations, I wonder if the clock frequency of the remaining 3 P cores gets
> reduced, sort of like the clock frequency gets reduced when using AVX512 instructions on a Xeon processor.
Yes, everything in a cluster runs at the same frequency.
There is no reason to believe that AMX reduces the frequency noticeably beyond the performance available when 4 P cores are running. No-one has reported this and a rough estimate of AMX performance (number of MACs, size of problem, blah blah) suggests that if frequency is reduced is by something minimal, not worth worrying about.
As I've said before, both AMX and NEON accept that they are used primarily in throughput tasks and don't try for stretch latency reduction (not just no 5.5GHz, but also every neon op takes at least two cycles, I think the minimum for AMX might be four cycles). This means smaller transistors being driven a lot less hard, so much less localized heating.