Detailed investigation of M1 load and store bandwidths from L1 out to DRAM

By: Andrei F (andrei.delete@this.anandtech.com), November 10, 2021 4:12 am
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on November 9, 2021 7:26 pm wrote:
> --- (---.delete@this.redheron.com) on November 9, 2021 1:39 pm wrote:
> > I have no idea how many people here read my (ongoing, the public version is only
> > version 0.7) exegesis of M1 internals. But those who have read the entire thing
> > (all 300+ pages!) will remember an on-going bafflement regarding the L1 cache.
>
> Why not use your previous name (maynard)? Sure we've had disagreements, but that's par
> for the course in forums. I read through some of it, and it's overall impressive analysis
> with good attention to detail (especially with non-sustained throughput). Good job!
>
> >> (a) we can perform three load pairs (one instruction, loads two successive 64b from an address)
> > or three load vector (one instruction, loads 128b) and so achieve 48B per cycle.
> > (b) but read what I said above. The LSU can sustain this, but the L1
> > can not because it can deliver a maximum of 16B from each half.
> >
> > So game over? Maximum throughput is about 32B*3.2GHz=~102.4GB/s.
> > Not so fast! If you actually measure things, in a variety of ways (my document lists an
> > exhausting set, STREAM-type benchmarks give a different sort of set) you will find that
> > you can actually sustain a higher bandwidth than this! Not much, but not negligible either,
> > call it around 10%, though it varies depending on the exact access pattern.
>
> Have you tried using a simple unrolled loop of ldr q, [address] linearly reading an array? STREAM
> results may vary depending on how well the compiler vectorizes it (from what I remember, that
> doesn't happen). Dougall says ldr q, [address] executes at 3/cycle. So I expect L1 throughput
> to be 153 GB/s, with anything lower caused by bank conflicts or noise in measurements.
>
> 3 loads per cycle is not out of the question, especially with 128-bit data paths, a low frequency design,
> and 5nm. Zen 3 does 3x64-bit loads per cycle with banking. And 153.6 GB/s of L1D load BW is far below
> >250 GB/s that you can get out of a Zen 2/3 core running above 4 GHz (256-bit AVX loads).
>
> > cases where two 16B loads can
> > be serviced from the L1, along with a 16B load from the store queue, for a throughput of 48B.
>
> You can test this by mixing in stores followed soon after by a load
> from the same address, in a 1:2 ratio with 'normal' loads.
>
> > So the point now is: thoughts?
> > The effect (up to 10% excess load bandwidth, over what you would expect, for streaming load patterns
> > limited to regions smaller than L1) is real, the only question is where it comes from.
> > The mechanism I posit is my best hypothesis so far.
>

> Andrei's code isn't open source, but there's likely some inefficiency with misaligned loads, loop overhead, boundary checking
> overhead, and address generation overhead (I ran into all of these optimizing my own mem bw benchmark).
> His measurements also seem very far off for x86 cores. He only gets 120-130 GB/s of L1D BW on Zen
> 2, or just above 32 bytes/cycle assuming the 3900X was running at 4 GHz. I was able to get ~266 GB/s
> of load bandwidth out of a Zen 2 core (3950X) with aligned 256-bit AVX loads.


You know, you can just talk to me instead of theorizing. The test works just fine, I only publish a fraction of the data.

Various bandwidth accesses of Zen3 vs M1 Max for example; again some overhead at small sizes as this was meant to be a MT test.


https://docs.google.com/spreadsheets/d/1HpWuiA57yJP2VfTyqtiSCVAK3yBqO0LzbkdMngi26RQ/edit?usp=sharing
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 02:39 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMGanon2021/11/09 08:02 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:31 PM
      Please don't use the MT graphsAndrei F2021/11/10 03:13 AM
        Please don't use the MT graphs---2021/11/10 10:26 AM
          Followup for Andrei---2021/11/10 06:43 PM
            Followup for AndreiAndrei F2021/11/11 02:30 AM
              Followup for Andrei---2021/11/11 10:21 AM
                Followup for AndreiChester2021/11/11 03:27 PM
                  Followup for Andrei---2021/11/11 03:57 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/09 08:26 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:37 PM
      Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/10 03:12 AM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAMAndrei F2021/11/10 04:12 AM
      Thanks for the dataChester2021/11/10 11:17 AM
        Thanks for the dataAndrei F2021/11/10 01:52 PM
          Thanks for the dataChester2021/11/11 12:16 AM
            Thanks for the dataAndrei F2021/11/11 02:45 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊