Detailed investigation of M1 load and store bandwidths from L1 out to DRAM

By: --- (---.delete@this.redheron.com), November 9, 2021 10:31 pm
Room: Moderated Discussions
Ganon (anon.delete@this.gmail.com) on November 9, 2021 7:02 pm wrote:
> --- (---.delete@this.redheron.com) on November 9, 2021 1:39 pm wrote:
> > I have no idea how many people here read my (ongoing, the public version is only
> > version 0.7) exegesis of M1 internals. But those who have read the entire thing
> > (all 300+ pages!) will remember an on-going bafflement regarding the L1 cache.
> >
>
>
> Thoroughly enjoyed the read; looking forward to the next update. Regarding
> m1 pro/max; seems some things have changed at least according to
>
> https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2
>
> where a single core has >100GB/s all the way from L1 to DRAM; even better
> than M1.


That graph seems to be measuring something different from the first Anandtech graph, the graph for the M1.
The M1 graph was, as far as I can tell, for pure *load* performance (at least that's my case that it matches most closely), and that's at least the most obvious case when you look at the Intel graphs I referenced.

The M1 Max refers, I think, to something like "read the end of the line, write it at the beginning, move 16 bytes in each direction, and repeat". I have not (yet) tried patterns like that.
What I can say to add to what I said earlier is:

- copy bandwidth in L1 and L2 (load from A, write to B, with A and B widely separated large regions) is about 50% higher than the basic load (or store) bandwidth. The L1 amplification is, I think, because of even more aggressive utilization of the "specialty pool" mechanism I described, that allows fully half the writes to be aggregated in store buffers, bypassing the limitation of only two 128B paths into the L1 cache proper;
the L2 amplification is because of separate (32B wide) read and write paths to L2. L3 and DRAM bandwidth are untouched.

- consider something like A=A+B. In this case we have two loads and a store, but the store is to essentially the same line as the load. In this case an additional bandwidth amplification mechanism kicks in, namely that while each L1 cache half is connected to the LSU by only a 16B "bus"
(a) these buses have separate read lines and write lines, and enough byte enable to clarify which bytes of the line are to be read or written
(b) Apple's magic L1, as I described, with patent reference, can load and store to the same line in the same cycle. So for this A=A+B case, the L1 bandwidth jumps to 190GB/s!

Andrei's code looks like it is able to use this second mechanism (writes can occur to the same line as reads) to amplify from the baseline 100GB/s up to 150, but not all the way to 200. I can't say more until I can do my own experiments.

More significant is the SLC and DRAM region. DRAM is easy, that's just twice as wide as M1. But the SLC region (and the fact that the DRAM can get to the L1 without being throttled) suggests that the NoC has also been bumped up! Possibilities include
- running it twice as fast. Not impossible, but probably not a first choice.
- having two parallel NoC's, essentially one that's carrying write traffic, one that's carrying reads. (Hard to imagine exactly how that would work generically; though if the current scheme is a ring, you could now have two rings, clockwise and counterclockwise)
- widening the NoC to 128B. This latter seems the most likely, but I'd want to see experiments with alternate stream patterns like copy vs pure read.

But yeah, obviously just stunning. I mean, what else can you say!


> Re: intel;
>
> not sure if any of these are related;
>
> https://arxiv.org/pdf/1907.00048.pdf talks about skylake server having non-overlapping accesses
> from the cache levels.
>
> Some of Aaron Spink's comments in
> https://www.realworldtech.com/forum/?threadid=169910&curpostid=170041
>
> might also be relevant regarding L2 reads effectively bypassing the L1.

Thanks for the refs! Reading for tonight!
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 02:39 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMGanon2021/11/09 08:02 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:31 PM
      Please don't use the MT graphsAndrei F2021/11/10 03:13 AM
        Please don't use the MT graphs---2021/11/10 10:26 AM
          Followup for Andrei---2021/11/10 06:43 PM
            Followup for AndreiAndrei F2021/11/11 02:30 AM
              Followup for Andrei---2021/11/11 10:21 AM
                Followup for AndreiChester2021/11/11 03:27 PM
                  Followup for Andrei---2021/11/11 03:57 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/09 08:26 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:37 PM
      Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/10 03:12 AM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAMAndrei F2021/11/10 04:12 AM
      Thanks for the dataChester2021/11/10 11:17 AM
        Thanks for the dataAndrei F2021/11/10 01:52 PM
          Thanks for the dataChester2021/11/11 12:16 AM
            Thanks for the dataAndrei F2021/11/11 02:45 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊