Detailed investigation of M1 load and store bandwidths from L1 out to DRAM

By: --- (, November 9, 2021 1:39 pm
Room: Moderated Discussions
I have no idea how many people here read my (ongoing, the public version is only version 0.7) exegesis of M1 internals. But those who have read the entire thing (all 300+ pages!) will remember an on-going bafflement regarding the L1 cache.

The parts that matter for us are:
- the M1 has 3 load units
- the cache itself is split into two parts, call these the "even part" (even numbered lines of 128B) and the "odd part". Each such part is connected to the LSU via what we can think of as
+ a 128b read data bus
+ a 128b write data bus
+ some control lines
+ some byte enables

The end result of all this is that, for example, suppose I have three int16 loads at three different locations within the same line; then all three can be serviced by the cache in the same cycle.

OK, so given this model, suppose we are attempting maximum load bandwidth. The naive expectation is that
(a) we can perform three load pairs (one instruction, loads two successive 64b from an address) or three load vector (one instruction, loads 128b) and so achieve 48B per cycle.
(b) but read what I said above. The LSU can sustain this, but the L1 can not because it can deliver a maximum of 16B from each half.

So game over? Maximum throughput is about 32B*3.2GHz=~102.4GB/s.
Not so fast! If you actually measure things, in a variety of ways (my document lists an exhausting set, STREAM-type benchmarks give a different sort of set) you will find that you can actually sustain a higher bandwidth than this! Not much, but not negligible either, call it around 10%, though it varies depending on the exact access pattern.

So what is going on here? This has worried me for months, but I think I have an answer.
Consider first the following question, which hints at the answer though is not relevant to the precise problem: how can a load be serviced if the data doesn't come from the L1 cache?
The answer is, of course, that the data could come from the store queue. And I expect that there are (rare, but non-zero; I haven't yet written a benchmark specifically testing this) cases where two 16B loads can be serviced from the L1, along with a 16B load from the store queue, for a throughput of 48B.

But think about this more generally. The L1 holds the cache proper, but also has a small pool of specialty buffers (i'm guessing about 8 or so?) used for things like
- cache castouts
- store buffers holding stores that missed in L1. (If enough of these fill a line fast enough, I think the line can be written straight to L2, bypassing L1 completely, and bypassing the load of the line [only to be overwritten] to L1) but I am not sure.
- lines delivered from L2 that need to be installed in the L1
- prefetch lines that may be held temporarily until they receive a load hit -- so that "uncertain" prefetch lines do not displace good lines in the cache

Significant points are that
- this small pool of lines has to be checked on every load, just like the L1 and just like the store queue, to be sure that a hit does not occur against one of these lines
- there must be some sort of path from this pool of lines to the LSU both to accept an address to test against all the lines, and to return a value if there is a match
- this pool needs its own path to/from the L1 cache proper. (At least, if it did not have its own path, we would see lower bandwidths than we see every time the L1 had to avoid servicing an LSU request because it was busy moving one of these specialty lined into or out of the L1 proper)

Now one can imagine various ways to set this up to save area, eg routing requests through the L1 proper on to the specialty buffer pool and back, but we all know that Apple is about using excess wire and area if the result will increase performance or lower energy!

OK, so our starting point is that any modern L1 has to have much the sort of setup like I've described, regardless.
Suppose some bright spark in Apple looks at all this and notes the following points:
(a) Like the "load hits in store queue" case, this specialty pool doesn't just enforce correctness, it also amplified bandwidth slightly in the cases where one load can hit in the specialty pool (and transfer data via the separate bus) along with two loads hitting two additional other lines in the main L1 cache. But that is probably an unusual case.
(b) HOWEVER suppose we do the following:
- if we note that an L1 line is being used repeatedly, or the equivalent, that we have a streaming access pattern (basically streaming access with a stride of 0 or 1) then we can
+ use the internal transfer mechanism to move the line that is being pounded (or will be pounded next) to the specialty pool
+ can access that line via the specialty pool along with, simultaneously some other line(s) in the streaming pattern, to amplify bandwidth!
It's not a strong amplification, but hell, 10% is 10%. I'm assuming that various checks and balances mean that the mechanism can't kick in on essentially every cycle (ie 50% amplification), rather the best we can do is get an addition third load (rather than just two loads) every fourth cycle or so. It may be that this is the very first chip with the mechanism, so it's somewhat experimental and non-aggressive, or that a few years of experience and simulation have shown that this level of aggression gives the best performance/energy tradeoff.

So the point now is: thoughts?
The effect (up to 10% excess load bandwidth, over what you would expect, for streaming load patterns limited to regions smaller than L1) is real, the only question is where it comes from.
The mechanism I posit is my best hypothesis so far.

What do people know about this in the case of eg, Intel? Obviously they have the same specialty pool of buffers holding data in transit to/from the cache, and the same issue of having to test loads (and stores) for hits against that data, though (TSO memory model, sigh, means they may not be able to exploit anything about this as aggressively as ARM). But are they aggressive in ensuring that these buffers do not interfere with the performance of the "main" cache? eg is it known that "installing prefetched lines" or "servicing cast-out lines" takes away from L1 bandwidth, or is this one of these things where the experts know "oh yeah, that used to be an issue up to Nehalem generation, but it was all fixed with Sandy Bridge"?
Is it expected that if a load hits in the specialty pool (rather than the L1 proper) it will augment bandwidth? Or is this treated as a "correct but sucks" case, where a replay or even a flush occurs, while the data is transferred out of the specialty buffer into the cache proper?

(a) (shout out to Travis) Apple seems to see no performance advantage in the handling of all-zero lines for either read or write for SLC and DRAM bandwidth. The patent about treating such lines specially talks about saving energy, but not performance, so it may well be that that's exactly it. For now anyway, even if the patent is implemented, what it gives you is some energy savings in reading or writing all zero lines; but no bandwidth boost.

(b) The graphs Andrei F gives for M1 single bandwidth are (no surprise) correct. They show essentially
~100 GB/s in L1 (Andrei does not notice or comment on the slight 10% boost above expectations)
- ~85 GB/s in the L2 region
- ~65 GB/s in the SLC region
- ~60 GB/s in the DRAM region

What Andrei does not show is store bandwidth, and this is even more amazing.
The 100GB/s in the L1 region (two stores per cycle, each 16B wide) persists all the way out to the L2 region. Obviously this tells us that there's a 32B path between L1 and L2 (which we could have calculated anyway from the load bandwidth) but it also tells us that there's basically zero overhead in the mechanism. I think what happens is once stores start missing in L1, they are aggregated into specialty buffers (4 cycles tims 32B per cycle to fill a buffer, four cycles to transfer it 32B per cycle to L2). The stores never even pass through the L1 proper, I suspect, and the co-ordination of specialty buffers, bus to L2, and L2 transferring the buffers into its storage is perfectly orchestrated without a cycle of overhead.

The read path is not as efficient (it has overhead of about one dead cycle per 4 to 5 live cycles) which I think has to do with collisions of the mechanism I described above of moving the current L1 hot line to the specialty pool while also needing to accept prefetched lines from L2 into the specialty pool (and then transferred into the L1 proper).

Once stores hit SLC they're still about 10% higher bandwidth than loads. My guess is that this has to do with the NoC; the L2 can gather together 2 (or more?) castout lines to be written back to SLC, and those lines can take a single hit of NoC header and frequency-domain-matching overhead amortized over 2 or more lines, whereas for loads, for obvious latency reasons, lines move from SLC to L2 one line at a time (though at some, perhaps, this could be improved if you know that the lines are being prefetched and so latency is not quite so pressing).
My guess is that the NoC is, unlike that L1 to L2 connection, 64B wide, but runs slower, rather than L1L2 which is narrower but faster. Overall Intel and Apple have very similar bandwidths to L3/SLC, representing, I think, convergence on the same 64B width and similar frequencies at this level, regardless of the differences in choices in the inner core.

Finally you thought the Apple read bandwidth (~90% of theoretical maximum, 68.5GB/s for LPDDR4x 4266 at 16B wide) was good? The write bandwidth is pretty much 100% of theoretical maximum!
I think this pretty much confirms that Apple must be running the SLC as a virtual write queue (ie very tight integration between the SLC and memory controller means that the memory controller opportunistically writes out data as appropriate given what pages are currently busy [refresh, precharge, being read, ...], in other words the SLC can act like a write queue for the memory controller as large as it likes, rather than the write queue being limited to a few dozen entries, with scheduling limited to visibility of those few dozen addresses.

The best I can find for something equivalent is Tiger Lake here: , which I believe is LPDDR4x at 4266 and 2 channels, ie 128b, wide). I am honestly at a loss as to why
Intel's DRAM bandwidth is so astonishingly bad compared to Apple. Is this setup in fact configured only 1 channel wide (so the comparison should be against n absolute peak of 34GB/s ?) The other weird thing I noticed comparing this to Apple is what's up with the L2 region?
Does Intel just not prefetch from L2 to L1? My point is -- if 128B loads (orange line) can hit in L1 at 140GB/s, and if the L2 can service data at 140GB/s (look at the dotted green line in the L2 region) why isn't data prefetched from the L2 to L1 at that 140GB/s? Obviously there's a lot I don't understand about Intel's constraints and tradeoffs, but there seems like a real bug, or at least very strange choice, here.
 Next Post in Thread >
TopicPosted ByDate
Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 01:39 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMGanon2021/11/09 07:02 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 09:31 PM
      Please don't use the MT graphsAndrei F2021/11/10 02:13 AM
        Please don't use the MT graphs---2021/11/10 09:26 AM
          Followup for Andrei---2021/11/10 05:43 PM
            Followup for AndreiAndrei F2021/11/11 01:30 AM
              Followup for Andrei---2021/11/11 09:21 AM
                Followup for AndreiChester2021/11/11 02:27 PM
                  Followup for Andrei---2021/11/11 02:57 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/09 07:26 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 09:37 PM
      Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/10 02:12 AM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAMAndrei F2021/11/10 03:12 AM
      Thanks for the dataChester2021/11/10 10:17 AM
        Thanks for the dataAndrei F2021/11/10 12:52 PM
          Thanks for the dataChester2021/11/10 11:16 PM
            Thanks for the dataAndrei F2021/11/11 01:45 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? ūüćä