Thanks for the data

By: Andrei F (andrei.delete@this.anandtech.com), November 10, 2021 1:52 pm
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on November 10, 2021 10:17 am wrote:
> Andrei F (andrei.delete@this.anandtech.com) on November 10, 2021 3:12 am wrote:
> > Chester (lamchester.delete@this.gmail.com) on November 9, 2021 7:26 pm wrote:
> > > --- (---.delete@this.redheron.com) on November 9, 2021 1:39 pm wrote:
> > > > I have no idea how many people here read my (ongoing, the public version is only
> > > > version 0.7) exegesis of M1 internals. But those who have read the entire thing
> > > > (all 300+ pages!) will remember an on-going bafflement regarding the L1 cache.
> > >
> > > Why not use your previous name (maynard)? Sure we've had disagreements, but that's par
> > > for the course in forums. I read through some of it, and it's overall impressive analysis
> > > with good attention to detail (especially with non-sustained throughput). Good job!
> > >
> > > >> (a) we can perform three load pairs (one instruction, loads two successive 64b from an address)
> > > > or three load vector (one instruction, loads 128b) and so achieve 48B per cycle.
> > > > (b) but read what I said above. The LSU can sustain this, but the L1
> > > > can not because it can deliver a maximum of 16B from each half.
> > > >
> > > > So game over? Maximum throughput is about 32B*3.2GHz=~102.4GB/s.
> > > > Not so fast! If you actually measure things, in a variety of ways (my document lists an
> > > > exhausting set, STREAM-type benchmarks give a different sort of set) you will find that
> > > > you can actually sustain a higher bandwidth than this! Not much, but not negligible either,
> > > > call it around 10%, though it varies depending on the exact access pattern.
> > >
> > > Have you tried using a simple unrolled loop of ldr q, [address] linearly reading an array? STREAM
> > > results may vary depending on how well the compiler vectorizes it (from what I remember, that
> > > doesn't happen). Dougall says ldr q, [address] executes at 3/cycle. So I expect L1 throughput
> > > to be 153 GB/s, with anything lower caused by bank conflicts or noise in measurements.
> > >
> > > 3 loads per cycle is not out of the question, especially with 128-bit data paths, a low frequency design,
> > > and 5nm. Zen 3 does 3x64-bit loads per cycle with banking. And 153.6 GB/s of L1D load BW is far below
> > > >250 GB/s that you can get out of a Zen 2/3 core running above 4 GHz (256-bit AVX loads).
> > >
> > > > cases where two 16B loads can
> > > > be serviced from the L1, along with a 16B load from the store queue, for a throughput of 48B.
> > >
> > > You can test this by mixing in stores followed soon after by a load
> > > from the same address, in a 1:2 ratio with 'normal' loads.
> > >
> > > > So the point now is: thoughts?
> > > > The effect (up to 10% excess load bandwidth, over what you would expect, for streaming load patterns
> > > > limited to regions smaller than L1) is real, the only question is where it comes from.
> > > > The mechanism I posit is my best hypothesis so far.
> > >
> >
> > > Andrei's code isn't open source, but there's likely some inefficiency
> > > with misaligned loads, loop overhead, boundary checking
> > > overhead, and address generation overhead (I ran into all of these optimizing my own mem bw benchmark).
> > > His measurements also seem very far off for x86 cores. He only gets 120-130 GB/s of L1D BW on Zen
> > > 2, or just above 32 bytes/cycle assuming the 3900X was running at 4 GHz. I was able to get ~266 GB/s
> > > of load bandwidth out of a Zen 2 core (3950X) with aligned 256-bit AVX loads.
> >
> >
> > You know, you can just talk to me instead of theorizing. The test
> > works just fine, I only publish a fraction of the data.
> >
> > Various bandwidth accesses of Zen3 vs M1 Max for example; again
> > some overhead at small sizes as this was meant to be a MT test.
> >
> >
> > https://docs.google.com/spreadsheets/d/1HpWuiA57yJP2VfTyqtiSCVAK3yBqO0LzbkdMngi26RQ/edit?usp=sharing
>
> Thanks! This looks much better for Zen 3. It's very close to the theoretical 2x256-bit per cycle
> load bandwidth. I was looking before at the Zen 2 data at https://images.anandtech.com/doci/14892/bw-3900.png and that just didn't look right.
>
> For M1, it seems to be simply 2x16B/cycle. The only result to exceed 32B/c is only 0.15% higher and I don't
> think that's significant. It could very well be timer variation. And I assume the 'scalar' loads used LDP.
>
> Also, this wasn't completely theorizing. When I first tried to measure cache/memory bandwidth by
> linearly reading an array, I had very inconsistent and unexpectedly low results for Zen 2, but not
> Zen 3. Aligning loads fixed the problem for Zen 2. It seems like Zen 3 actually has 3x256-bit L1D
> load ports (but only a 2x256-bit path to the FPU) and was able to absorb the misaligned accesses.

I don't remember what happened on that graph, I looked up the various Zen2 results sets and they're all >250GB/s for 256-bit loads. You're right in that it's wrong there but it's not the test that is awry, I may have screwed something up when copying things.

The accesses are all aligned here in all the tests.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 02:39 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMGanon2021/11/09 08:02 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:31 PM
      Please don't use the MT graphsAndrei F2021/11/10 03:13 AM
        Please don't use the MT graphs---2021/11/10 10:26 AM
          Followup for Andrei---2021/11/10 06:43 PM
            Followup for AndreiAndrei F2021/11/11 02:30 AM
              Followup for Andrei---2021/11/11 10:21 AM
                Followup for AndreiChester2021/11/11 03:27 PM
                  Followup for Andrei---2021/11/11 03:57 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/09 08:26 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:37 PM
      Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/10 03:12 AM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAMAndrei F2021/11/10 04:12 AM
      Thanks for the dataChester2021/11/10 11:17 AM
        Thanks for the dataAndrei F2021/11/10 01:52 PM
          Thanks for the dataChester2021/11/11 12:16 AM
            Thanks for the dataAndrei F2021/11/11 02:45 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊