Sectoring and suchlike in L1 caches

By: --- (, November 29, 2021 9:03 pm
Room: Moderated Discussions
I've discovered a few interesting things about the M1 L1D. As usual the full report will eventually appear with all the supporting data, but for now I put this out there, both to clear up some common misapprehensions and to ask if people are aware of similar behavior in other systems.

So, intersting point:
(a) The M1's L1D line length is NOT 128B, it is 64B. (Even though Apple reports it as 128B).
By line length I mean essentially "minimal unit of addressability". The L1D can hold 2048 different units of addressability, not the 1024 you'd get from a 128B line length.

(b) Apple says it is 128B ( type sysctl hw.cachelinesize in terminal) but I think they're doing this as a best single number (if you insist on a single number) for a complex situation.

Obviously other case in which line length matters are coherence and locking granularity and I'll leave it to others (at least for now) to investigate those.
My guess is that from the L2 outward the granularity is 128B, within a cluster it is maybe 64B, with something complicated being impedance-matched by the L2 when required.
Alternatively when interesting coherence or locking occurs, the partner line is essentially flushed and the two neighboring lines fused to a single 128B-apparent line.

(c) Region prefetching appears to be universal in high-end CPUs, but does it go down to the L1? The papers I've seen all talk about doing it in the context of transfers into L2. Apple is region prefetching into L1. The region size appears to be 16kiB or anything smaller. My tests did not probe if two (or more) regions could simultaneously be active. Is anyone doing that? Is it worth testing for?

(d) Obviously cache line length is always a compromise. Cases of substantial spatial reuse benefit from longer cache lines (in the extreme case this shades into a region prefetcher). Cases of minimal spatial reuse (I load in one next-node pointer then jump far away, and the rest of the line is never used, at least not before the line has been replaced) benefit (at least in terms of energy, possibly also in terms of fewer cycles of the transfer bus between L2 and L1 still being busy by the time I want to transfer the next line) benefit from a sub-sectored cache.

Apple seems to be (by which I mean there is strong numerical evidence consistent with this pattern and with nothing else obvious) dynamically switching these various options.

In the case of most minimal data reuse, the L2->L1 transfer unit is a half-line, ie 32B.
In the case where there is some reuse of data between adjacent lines, greedy transfer moves a double line of 128B or even 256B. My guess is under the conditions of the region prefetcher, the transfer unit may be as large as 512B (so 16 successive 32B transfers in response to one L1 request).

This is fairly remarkable stuff! Sectored caches have existed before (I think most of the PPCs were using sectoring, but I don't know if it was dynamic; I think it was a static scheme of always transferring a half line, but allowing two half lines to share a single address tag back in the days when the area of an address tag was something you cared about). And pulling in a companion cache line (or something similar like next-line prefetching) is basically prefetch stage zero, before you even bother with a stride prefetcher.
But the full range of options here, with dynamic transitioning between transferring half a cache line to up to 16 cache lines, based on prevailing conditions, and into the L1 (as opposed to operating at the L2 level) goes far beyond anything else of which I'm aware.
 Next Post in Thread >
TopicPosted ByDate
Sectoring and suchlike in L1 caches---2021/11/29 09:03 PM
  Sectoring and suchlike in L1 cachesChester2021/11/30 07:17 AM
  Testing cacheline size intra vs inter clusterGanon2021/12/01 07:32 PM
    Testing cacheline size intra vs inter cluster---2021/12/02 11:47 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? ūüćä