Thanks for the data

By: Chester (, November 10, 2021 11:17 am
Room: Moderated Discussions
Andrei F ( on November 10, 2021 3:12 am wrote:
> Chester ( on November 9, 2021 7:26 pm wrote:
> > --- ( on November 9, 2021 1:39 pm wrote:
> > > I have no idea how many people here read my (ongoing, the public version is only
> > > version 0.7) exegesis of M1 internals. But those who have read the entire thing
> > > (all 300+ pages!) will remember an on-going bafflement regarding the L1 cache.
> >
> > Why not use your previous name (maynard)? Sure we've had disagreements, but that's par
> > for the course in forums. I read through some of it, and it's overall impressive analysis
> > with good attention to detail (especially with non-sustained throughput). Good job!
> >
> > >> (a) we can perform three load pairs (one instruction, loads two successive 64b from an address)
> > > or three load vector (one instruction, loads 128b) and so achieve 48B per cycle.
> > > (b) but read what I said above. The LSU can sustain this, but the L1
> > > can not because it can deliver a maximum of 16B from each half.
> > >
> > > So game over? Maximum throughput is about 32B*3.2GHz=~102.4GB/s.
> > > Not so fast! If you actually measure things, in a variety of ways (my document lists an
> > > exhausting set, STREAM-type benchmarks give a different sort of set) you will find that
> > > you can actually sustain a higher bandwidth than this! Not much, but not negligible either,
> > > call it around 10%, though it varies depending on the exact access pattern.
> >
> > Have you tried using a simple unrolled loop of ldr q, [address] linearly reading an array? STREAM
> > results may vary depending on how well the compiler vectorizes it (from what I remember, that
> > doesn't happen). Dougall says ldr q, [address] executes at 3/cycle. So I expect L1 throughput
> > to be 153 GB/s, with anything lower caused by bank conflicts or noise in measurements.
> >
> > 3 loads per cycle is not out of the question, especially with 128-bit data paths, a low frequency design,
> > and 5nm. Zen 3 does 3x64-bit loads per cycle with banking. And 153.6 GB/s of L1D load BW is far below
> > >250 GB/s that you can get out of a Zen 2/3 core running above 4 GHz (256-bit AVX loads).
> >
> > > cases where two 16B loads can
> > > be serviced from the L1, along with a 16B load from the store queue, for a throughput of 48B.
> >
> > You can test this by mixing in stores followed soon after by a load
> > from the same address, in a 1:2 ratio with 'normal' loads.
> >
> > > So the point now is: thoughts?
> > > The effect (up to 10% excess load bandwidth, over what you would expect, for streaming load patterns
> > > limited to regions smaller than L1) is real, the only question is where it comes from.
> > > The mechanism I posit is my best hypothesis so far.
> >
> > Andrei's code isn't open source, but there's likely some inefficiency
> > with misaligned loads, loop overhead, boundary checking
> > overhead, and address generation overhead (I ran into all of these optimizing my own mem bw benchmark).
> > His measurements also seem very far off for x86 cores. He only gets 120-130 GB/s of L1D BW on Zen
> > 2, or just above 32 bytes/cycle assuming the 3900X was running at 4 GHz. I was able to get ~266 GB/s
> > of load bandwidth out of a Zen 2 core (3950X) with aligned 256-bit AVX loads.
> You know, you can just talk to me instead of theorizing. The test
> works just fine, I only publish a fraction of the data.
> Various bandwidth accesses of Zen3 vs M1 Max for example; again
> some overhead at small sizes as this was meant to be a MT test.

Thanks! This looks much better for Zen 3. It's very close to the theoretical 2x256-bit per cycle load bandwidth. I was looking before at the Zen 2 data at 3900X AT BW measurements and that just didn't look right.

For M1, it seems to be simply 2x16B/cycle. The only result to exceed 32B/c is only 0.15% higher and I don't think that's significant. It could very well be timer variation. And I assume the 'scalar' loads used LDP.

Also, this wasn't completely theorizing. When I first tried to measure cache/memory bandwidth by linearly reading an array, I had very inconsistent and unexpectedly low results for Zen 2, but not Zen 3. Aligning loads fixed the problem for Zen 2. It seems like Zen 3 actually has 3x256-bit L1D load ports (but only a 2x256-bit path to the FPU) and was able to absorb the misaligned accesses.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 02:39 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMGanon2021/11/09 08:02 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:31 PM
      Please don't use the MT graphsAndrei F2021/11/10 03:13 AM
        Please don't use the MT graphs---2021/11/10 10:26 AM
          Followup for Andrei---2021/11/10 06:43 PM
            Followup for AndreiAndrei F2021/11/11 02:30 AM
              Followup for Andrei---2021/11/11 10:21 AM
                Followup for AndreiChester2021/11/11 03:27 PM
                  Followup for Andrei---2021/11/11 03:57 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/09 08:26 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:37 PM
      Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/10 03:12 AM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAMAndrei F2021/11/10 04:12 AM
      Thanks for the dataChester2021/11/10 11:17 AM
        Thanks for the dataAndrei F2021/11/10 01:52 PM
          Thanks for the dataChester2021/11/11 12:16 AM
            Thanks for the dataAndrei F2021/11/11 02:45 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊