Detailed investigation of M1 load and store bandwidths from L1 out to DRAM

By: Chester (lamchester.delete@this.gmail.com), November 10, 2021 3:12 am
Room: Moderated Discussions
--- (--.delete@this.redheron.com) on November 9, 2021 9:37 pm wrote:
> Chester (lamchester.delete@this.gmail.com) on November 9, 2021 7:26 pm wrote:
>
>
> > >> (a) we can perform three load pairs (one instruction, loads two successive 64b from an address)
> > > or three load vector (one instruction, loads 128b) and so achieve 48B per cycle.
> > > (b) but read what I said above. The LSU can sustain this, but the L1
> > > can not because it can deliver a maximum of 16B from each half.
> > >
> > > So game over? Maximum throughput is about 32B*3.2GHz=~102.4GB/s.
> > > Not so fast! If you actually measure things, in a variety of ways (my document lists an
> > > exhausting set, STREAM-type benchmarks give a different sort of set) you will find that
> > > you can actually sustain a higher bandwidth than this! Not much, but not negligible either,
> > > call it around 10%, though it varies depending on the exact access pattern.
> >
> > Have you tried using a simple unrolled loop of ldr q, [address] linearly reading an array? STREAM
> > results may vary depending on how well the compiler vectorizes it (from what I remember, that
> > doesn't happen). Dougall says ldr q, [address] executes at 3/cycle. So I expect L1 throughput
> > to be 153 GB/s, with anything lower caused by bank conflicts or noise in measurements.
> >
> > 3 loads per cycle is not out of the question, especially with 128-bit data paths, a low frequency design,
> > and 5nm. Zen 3 does 3x64-bit loads per cycle with banking. And 153.6 GB/s of L1D load BW is far below
> > >250 GB/s that you can get out of a Zen 2/3 core running above 4 GHz (256-bit AVX loads).
>
>
> Chester go back and read the section on the L1 architecture.
> The issue is not three loads per cycle, no-one doubts that and it's easily attested.
> The issue is the width and structure of the connections between the L1 and the LSU. It's irrelevant that
> the LSU can issue three 16B wide loads in a cycle if the L1 can only SERVICE two 16B loads in a cycle.

Reread my post please. I'm suggesting that the L1 can deliver 3x16B loads per cycle and it's completely reasonable to see that on a low clocked 5nm design. I suggested trying some different tests to see if that's the case. Because when you average over 2 loads per cycle (from the L1D to LSU), that strongly suggests there are at least 3 L1D load ports, but your test code is bottlenecked by something other than L1D load ports.

>
> The very fact that you are discussing this in language (bank conflicts) relevant to the traditional
> world, not to Apple's cache, tells me you completely miss the point I am trying to resolve.
>

Well, I guess nothing has changed. From you M1 doc, I got the impression you were approaching things from a critical point of view, and using careful experimentation to validate theories. I admit I'm a bit frustrated that my impression was wrong.

Sigh, one day I'll get access to a M1 machine and do my own testing.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 02:39 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMGanon2021/11/09 08:02 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:31 PM
      Please don't use the MT graphsAndrei F2021/11/10 03:13 AM
        Please don't use the MT graphs---2021/11/10 10:26 AM
          Followup for Andrei---2021/11/10 06:43 PM
            Followup for AndreiAndrei F2021/11/11 02:30 AM
              Followup for Andrei---2021/11/11 10:21 AM
                Followup for AndreiChester2021/11/11 03:27 PM
                  Followup for Andrei---2021/11/11 03:57 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/09 08:26 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:37 PM
      Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/10 03:12 AM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAMAndrei F2021/11/10 04:12 AM
      Thanks for the dataChester2021/11/10 11:17 AM
        Thanks for the dataAndrei F2021/11/10 01:52 PM
          Thanks for the dataChester2021/11/11 12:16 AM
            Thanks for the dataAndrei F2021/11/11 02:45 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊