By: --- (---.delete@this.redheron.com), May 14, 2022 12:06 pm
Room: Moderated Discussions
Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 14, 2022 6:11 am wrote:
> Matt Lomann (mlohmann.delete@this.noemail.com) on May 13, 2022 7:02 pm wrote:
> > Thank you. That makes sense. The complexity of these modern processors is shocking. It’s a
> > miracle they ever work at all. Hiding the AMX instructions behind an API allows Apple to fix
> > some hardware bugs with software, such as by not using a particular sequence of instructions.
> >
> It also allows Apple to change the instruction mix over time - the operations
> are expected to be high latency anyway (function call, not instruction), and
> thus it's OK if Apple shifts functionality to and from software over time.
>
> It also lets Apple do the stunt that Transmeta is alleged to have done with CMS in the TM5800 series chips
> - you can have instruction sequences that are not allowed because they're too thermally expensive (thus induce
> surprise throttling), and Apple can simply make sure that the approved API doesn't run those sequences.
This is certainly theoretically possible, but I don't think Apple care.
Apple have a magnificent Digital Power Estimator mechanism across the entire SoC (with a patent specifically talking about how the DPE was improved to better estimate the cost of AMX instructions). So it makes more sense to just issue the optimal instruction set and, if the DPE estimate a problem, it will do the necessary "throttling" by just freezing the AMX clock every few cycles to maintain switching activity right at the maximum allowed level and no higher.
> More, those sequences can change from generation to generation, and even based on
> the cooling system in use, and Apple can simply account for this in software.
>
> > The AMX engine must run at the same clock frequency as
> > the 4 P cores or 2 E cores it is connected to. If one
> > P core starts doing a lot of AMX operations, I wonder if
> > the clock frequency of the remaining 3 P cores gets
> > reduced, sort of like the clock frequency gets reduced when using AVX512 instructions on a Xeon processor.
>
>
> Matt Lomann (mlohmann.delete@this.noemail.com) on May 13, 2022 7:02 pm wrote:
> > Thank you. That makes sense. The complexity of these modern processors is shocking. It’s a
> > miracle they ever work at all. Hiding the AMX instructions behind an API allows Apple to fix
> > some hardware bugs with software, such as by not using a particular sequence of instructions.
> >
> It also allows Apple to change the instruction mix over time - the operations
> are expected to be high latency anyway (function call, not instruction), and
> thus it's OK if Apple shifts functionality to and from software over time.
>
> It also lets Apple do the stunt that Transmeta is alleged to have done with CMS in the TM5800 series chips
> - you can have instruction sequences that are not allowed because they're too thermally expensive (thus induce
> surprise throttling), and Apple can simply make sure that the approved API doesn't run those sequences.
This is certainly theoretically possible, but I don't think Apple care.
Apple have a magnificent Digital Power Estimator mechanism across the entire SoC (with a patent specifically talking about how the DPE was improved to better estimate the cost of AMX instructions). So it makes more sense to just issue the optimal instruction set and, if the DPE estimate a problem, it will do the necessary "throttling" by just freezing the AMX clock every few cycles to maintain switching activity right at the maximum allowed level and no higher.
> More, those sequences can change from generation to generation, and even based on
> the cooling system in use, and Apple can simply account for this in software.
>
> > The AMX engine must run at the same clock frequency as
> > the 4 P cores or 2 E cores it is connected to. If one
> > P core starts doing a lot of AMX operations, I wonder if
> > the clock frequency of the remaining 3 P cores gets
> > reduced, sort of like the clock frequency gets reduced when using AVX512 instructions on a Xeon processor.
>
>