Detailed investigation of M1 load and store bandwidths from L1 out to DRAM

By: Chester (, November 9, 2021 8:26 pm
Room: Moderated Discussions
--- ( on November 9, 2021 1:39 pm wrote:
> I have no idea how many people here read my (ongoing, the public version is only
> version 0.7) exegesis of M1 internals. But those who have read the entire thing
> (all 300+ pages!) will remember an on-going bafflement regarding the L1 cache.

Why not use your previous name (maynard)? Sure we've had disagreements, but that's par for the course in forums. I read through some of it, and it's overall impressive analysis with good attention to detail (especially with non-sustained throughput). Good job!

>> (a) we can perform three load pairs (one instruction, loads two successive 64b from an address)
> or three load vector (one instruction, loads 128b) and so achieve 48B per cycle.
> (b) but read what I said above. The LSU can sustain this, but the L1
> can not because it can deliver a maximum of 16B from each half.
> So game over? Maximum throughput is about 32B*3.2GHz=~102.4GB/s.
> Not so fast! If you actually measure things, in a variety of ways (my document lists an
> exhausting set, STREAM-type benchmarks give a different sort of set) you will find that
> you can actually sustain a higher bandwidth than this! Not much, but not negligible either,
> call it around 10%, though it varies depending on the exact access pattern.

Have you tried using a simple unrolled loop of ldr q, [address] linearly reading an array? STREAM results may vary depending on how well the compiler vectorizes it (from what I remember, that doesn't happen). Dougall says ldr q, [address] executes at 3/cycle. So I expect L1 throughput to be 153 GB/s, with anything lower caused by bank conflicts or noise in measurements.

3 loads per cycle is not out of the question, especially with 128-bit data paths, a low frequency design, and 5nm. Zen 3 does 3x64-bit loads per cycle with banking. And 153.6 GB/s of L1D load BW is far below >250 GB/s that you can get out of a Zen 2/3 core running above 4 GHz (256-bit AVX loads).

> cases where two 16B loads can
> be serviced from the L1, along with a 16B load from the store queue, for a throughput of 48B.

You can test this by mixing in stores followed soon after by a load from the same address, in a 1:2 ratio with 'normal' loads.

> So the point now is: thoughts?
> The effect (up to 10% excess load bandwidth, over what you would expect, for streaming load patterns
> limited to regions smaller than L1) is real, the only question is where it comes from.
> The mechanism I posit is my best hypothesis so far.

My guess is M1 is actually capable of 153 GB/s L1D load bandwidth per core. Andrei's code isn't open source, but there's likely some inefficiency with misaligned loads, loop overhead, boundary checking overhead, and address generation overhead (I ran into all of these optimizing my own mem bw benchmark). His measurements also seem very far off for x86 cores. He only gets 120-130 GB/s of L1D BW on Zen 2, or just above 32 bytes/cycle assuming the 3900X was running at 4 GHz. I was able to get ~266 GB/s of load bandwidth out of a Zen 2 core (3950X) with aligned 256-bit AVX loads.

> What do people know about this in the case of eg, Intel? Obviously they have the same specialty pool of buffers
> holding data in transit to/from the cache, and the same issue of having to test loads (and stores) for hits
> against that data, though (TSO memory model, sigh, means they may not be able to exploit anything about this
> as aggressively as ARM). But are they aggressive in ensuring that these buffers do not interfere with the
> performance of the "main" cache? eg is it known that "installing prefetched lines" or "servicing cast-out
> lines" takes away from L1 bandwidth, or is this one of these things where the experts know "oh yeah, that
> used to be an issue up to Nehalem generation, but it was all fixed with Sandy Bridge"?
> Is it expected that if a load hits in the specialty pool (rather than the L1 proper) it will augment
> bandwidth? Or is this treated as a "correct but sucks" case, where a replay or even a flush occurs,
> while the data is transferred out of the specialty buffer into the cache proper?

I don't see anything strange about Intel - L1D load BW is pretty close to theoretical on all the chips I tested. I don't think you can increase bandwidth through store forwarding. From Sandy Bridge to Ice Lake, you have two L1D load ports and two AGUs for loads, so you can't generate addresses faster than you can load data for them anyway.

Maybe in theory you can combine loads that go to the same address, but I've never seen it happen. Even if it only sends one load to the L1D, you're still bound by AGU throughput to figure out the addresses in the first place.

> (a) (shout out to Travis) Apple seems to see no performance advantage in the handling of all-zero
> lines for either read or write for SLC and DRAM bandwidth. The patent about treating such
> lines specially talks about saving energy, but not performance, so it may well be that that's
> exactly it. For now anyway, even if the patent is implemented, what it gives you is some energy
> savings in reading or writing all zero lines; but no bandwidth boost.
> (b) The graphs Andrei F gives for M1 single bandwidth are (no surprise) correct. They show essentially
> ~100 GB/s in L1 (Andrei does not notice or comment on the slight 10% boost above expectations)
> - ~85 GB/s in the L2 region
> - ~65 GB/s in the SLC region
> - ~60 GB/s in the DRAM region
> What Andrei does not show is store bandwidth, and this is even more amazing.
> The 100GB/s in the L1 region (two stores per cycle, each 16B wide) persists all the way out to the L2 region.
> Obviously this tells us that there's a 32B path between L1 and L2 (which we could have calculated anyway
> from the load bandwidth) but it also tells us that there's basically zero overhead in the mechanism. I think
> what happens is once stores start missing in L1, they are aggregated into specialty buffers (4 cycles tims
> 32B per cycle to fill a buffer, four cycles to transfer it 32B per cycle to L2). The stores never even pass
> through the L1 proper, I suspect, and the co-ordination of specialty buffers, bus to L2, and L2 transferring
> the buffers into its storage is perfectly orchestrated without a cycle of overhead.
> The read path is not as efficient (it has overhead of about one dead cycle per 4 to 5 live
> cycles) which I think has to do with collisions of the mechanism I described above of moving
> the current L1 hot line to the specialty pool while also needing to accept prefetched lines
> from L2 into the specialty pool (and then transferred into the L1 proper).

I think we need to see Andrei's code to know more. Unfortunately, it doesn't seem to be open source. If you have access to M1, could you give a run? "make aarch64" should give you an executable, and running it without args will show single-threaded bandwidth with aligned NEON loads in an unrolled loop.

> Once stores hit SLC they're still about 10% higher bandwidth than loads. My guess is that this
> has to do with the NoC; the L2 can gather together 2 (or more?) castout lines to be written
> back to SLC, and those lines can take a single hit of NoC header and frequency-domain-matching
> overhead amortized over 2 or more lines, whereas for loads, for obvious latency reasons, lines
> move from SLC to L2 one line at a time (though at some, perhaps, this could be improved if you
> know that the lines are being prefetched and so latency is not quite so pressing).
> My guess is that the NoC is, unlike that L1 to L2 connection, 64B wide, but runs slower,
> rather than L1L2 which is narrower but faster. Overall Intel and Apple have very similar
> bandwidths to L3/SLC, representing, I think, convergence on the same 64B width and similar
> frequencies at this level, regardless of the differences in choices in the inner core.
> Finally you thought the Apple read bandwidth (~90% of theoretical maximum, 68.5GB/s for LPDDR4x
> 4266 at 16B wide) was good? The write bandwidth is pretty much 100% of theoretical maximum!
> I think this pretty much confirms that Apple must be running the SLC as a virtual write queue (ie very tight integration
> between the SLC and memory controller means that the memory controller opportunistically writes out data as appropriate
> given what pages are currently busy [refresh, precharge, being read, ...], in other words the SLC can act like
> a write queue for the memory controller as large as it likes, rather than the write queue being limited to a few
> dozen entries, with scheduling limited to visibility of those few dozen addresses.
> The best I can find for something equivalent is Tiger Lake here:
> , which I
> believe is LPDDR4x at 4266 and 2 channels, ie 128b, wide). I am honestly at a loss as to why
> Intel's DRAM bandwidth is so astonishingly bad compared to Apple. Is this setup in fact configured
> only 1 channel wide (so the comparison should be against n absolute peak of 34GB/s ?) The other
> weird thing I noticed comparing this to Apple is what's up with the L2 region?
> Does Intel just not prefetch from L2 to L1? My point is -- if 128B loads (orange line) can hit in L1 at 140GB/s,
> and if the L2 can service data at 140GB/s (look at the dotted green line in the L2 region) why isn't data
> prefetched from the L2 to L1 at that 140GB/s? Obviously there's a lot I don't understand about Intel's constraints
> and tradeoffs, but there seems like a real bug, or at least very strange choice, here.

I've noticed this too - Intel's memory bandwidth is poor for a single core. I think it's because they can't handle enough L2 or L3 misses per core to absorb memory latency. But I don't think it's a bug. Rather it seems intentional because a single core simply doesn't need more bandwidth.

One way of checking whether bandwidth is a bottleneck is counting cycles where the fill buffer (L1D misses) or superqueue (L2 misses) fills up. On Haswell that's often just a percent or two of core cycles. I also checked how often Zen 2 allocates more than 12 MABs (miss address buffers, AMD's equivalent of fill buffers), and it's also 1-2% of core cycles. So Zen 2's higher single-core bandwidth to L3/memory is rarely used.

With Golden Cove, Intel has more memory bandwidth per core - about 32 GB/s with DDR4-2400 from my testing. They're probably getting a percent or less of performance gains from that, but against Zen 3 they're trying to get every last bit.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 02:39 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMGanon2021/11/09 08:02 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:31 PM
      Please don't use the MT graphsAndrei F2021/11/10 03:13 AM
        Please don't use the MT graphs---2021/11/10 10:26 AM
          Followup for Andrei---2021/11/10 06:43 PM
            Followup for AndreiAndrei F2021/11/11 02:30 AM
              Followup for Andrei---2021/11/11 10:21 AM
                Followup for AndreiChester2021/11/11 03:27 PM
                  Followup for Andrei---2021/11/11 03:57 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/09 08:26 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:37 PM
      Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/10 03:12 AM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAMAndrei F2021/11/10 04:12 AM
      Thanks for the dataChester2021/11/10 11:17 AM
        Thanks for the dataAndrei F2021/11/10 01:52 PM
          Thanks for the dataChester2021/11/11 12:16 AM
            Thanks for the dataAndrei F2021/11/11 02:45 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊