By: Tom Shaw (tom.shaw.delete@this.null.com),
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on June 5, 2025 10:36 am wrote:
> if we have one P-core active, it can apparently use no more than, say, 9MB of the 12MB L2.
Thank you for your comments. On page 106, Vol. 2 of your PDFs, it says one active P-core on M1 can use a maximum of 7.5MB out of 12MB of L2 cache. Is your current thinking that the maximum L2 for one P-core on M1 is 9MB, rather than 7.5MB, or are the experimental results not able to distinguish between a maximum L2 for one P-core of 7.5MB vs 9MB?
To get 9MB, each P-core could access what you called an inner L2 of 3MB plus the inner L2 of the two closest neighbors. If 4 P-cores are imagined to be organized as 2x2, each P-core would access horizontally or vertically adjacent inner L2, but not the inner L2 on the diagonal.
For M3 Pro and M3 Max P-clusters where 6 P-cores share a 16MB L2, the floorplan indicates that 2 P-cores are close to 8MB of L2 (labeled cores 2 and 3 on M3 Pro) and 4 P-cores are close to the other 8 MB of L2 (labeled cores 0, 1, 4 and 5 on M3 Pro). The P-cluster floorplan for M3 Max is exactly the same except there are two P-clusters instead of one. Do you have any guess about the inner L2 and outer L2 sizes for these P-clusters?

I agree with the point you made that having one active CPU core does not happen very often, outside of benchmarking single-threaded code.
> You can also do things like set replacement policy for a DSID to be LRU or MRU; or indicate that you want data
> in a DSID preserved in SLC as, say, every 4th line, rather than every line. There's a lot of control available
> and more keeps being added.
Saving every 4th cache line in SLC is an interesting feature. Perhaps the prefetch hardware fetches the next 3 cache lines from DRAM when a cache line is accessed.
The text strings "Data Set ID", "Data Set Identifier", "Dataset ID" and "Dataset Identifier" do not appear anywhere on Apple's website. The only DSID mentioned on Apple's website is a "Destination Signaling Identifier", which is an identifier tied to an iCloud account. Is there some other name that could be used to find information about the features you described to control what is saved in the SLC?
> As far as CPUs are concerned, it happens at the L2 level, with lines being designated as critical,
> either statically (as described) or dynamically (basically a CPU core detects when certain loads
> result in the machine "clogging up" after the load misses, and detect such loads in the L1 as
> critical; this bit is then transferred to L2 and the lines preserved there).
> https://patents.google.com/patent/US20230060225A1 (mostly the L1 side)
> https://patents.google.com/patent/US12222875B1 (mostly the L2 side)
The feature of detecting "when certain loads result in the machine clogging up after the load misses, and detect such loads in the L1 as critical; this bit is then transferred to L2 and the lines preserved there" is impressive. Have you or anyone you know tried to detect this behavior in Apple hardware?
> if we have one P-core active, it can apparently use no more than, say, 9MB of the 12MB L2.
Thank you for your comments. On page 106, Vol. 2 of your PDFs, it says one active P-core on M1 can use a maximum of 7.5MB out of 12MB of L2 cache. Is your current thinking that the maximum L2 for one P-core on M1 is 9MB, rather than 7.5MB, or are the experimental results not able to distinguish between a maximum L2 for one P-core of 7.5MB vs 9MB?
To get 9MB, each P-core could access what you called an inner L2 of 3MB plus the inner L2 of the two closest neighbors. If 4 P-cores are imagined to be organized as 2x2, each P-core would access horizontally or vertically adjacent inner L2, but not the inner L2 on the diagonal.
For M3 Pro and M3 Max P-clusters where 6 P-cores share a 16MB L2, the floorplan indicates that 2 P-cores are close to 8MB of L2 (labeled cores 2 and 3 on M3 Pro) and 4 P-cores are close to the other 8 MB of L2 (labeled cores 0, 1, 4 and 5 on M3 Pro). The P-cluster floorplan for M3 Max is exactly the same except there are two P-clusters instead of one. Do you have any guess about the inner L2 and outer L2 sizes for these P-clusters?

I agree with the point you made that having one active CPU core does not happen very often, outside of benchmarking single-threaded code.
> You can also do things like set replacement policy for a DSID to be LRU or MRU; or indicate that you want data
> in a DSID preserved in SLC as, say, every 4th line, rather than every line. There's a lot of control available
> and more keeps being added.
Saving every 4th cache line in SLC is an interesting feature. Perhaps the prefetch hardware fetches the next 3 cache lines from DRAM when a cache line is accessed.
The text strings "Data Set ID", "Data Set Identifier", "Dataset ID" and "Dataset Identifier" do not appear anywhere on Apple's website. The only DSID mentioned on Apple's website is a "Destination Signaling Identifier", which is an identifier tied to an iCloud account. Is there some other name that could be used to find information about the features you described to control what is saved in the SLC?
> As far as CPUs are concerned, it happens at the L2 level, with lines being designated as critical,
> either statically (as described) or dynamically (basically a CPU core detects when certain loads
> result in the machine "clogging up" after the load misses, and detect such loads in the L1 as
> critical; this bit is then transferred to L2 and the lines preserved there).
> https://patents.google.com/patent/US20230060225A1 (mostly the L1 side)
> https://patents.google.com/patent/US12222875B1 (mostly the L2 side)
The feature of detecting "when certain loads result in the machine clogging up after the load misses, and detect such loads in the L1 as critical; this bit is then transferred to L2 and the lines preserved there" is impressive. Have you or anyone you know tried to detect this behavior in Apple hardware?


