Detailed investigation of M1 load and store bandwidths from L1 out to DRAM

By: --- (--.delete@this.redheron.com), November 9, 2021 10:37 pm
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on November 9, 2021 7:26 pm wrote:


> >> (a) we can perform three load pairs (one instruction, loads two successive 64b from an address)
> > or three load vector (one instruction, loads 128b) and so achieve 48B per cycle.
> > (b) but read what I said above. The LSU can sustain this, but the L1
> > can not because it can deliver a maximum of 16B from each half.
> >
> > So game over? Maximum throughput is about 32B*3.2GHz=~102.4GB/s.
> > Not so fast! If you actually measure things, in a variety of ways (my document lists an
> > exhausting set, STREAM-type benchmarks give a different sort of set) you will find that
> > you can actually sustain a higher bandwidth than this! Not much, but not negligible either,
> > call it around 10%, though it varies depending on the exact access pattern.
>
> Have you tried using a simple unrolled loop of ldr q, [address] linearly reading an array? STREAM
> results may vary depending on how well the compiler vectorizes it (from what I remember, that
> doesn't happen). Dougall says ldr q, [address] executes at 3/cycle. So I expect L1 throughput
> to be 153 GB/s, with anything lower caused by bank conflicts or noise in measurements.
>
> 3 loads per cycle is not out of the question, especially with 128-bit data paths, a low frequency design,
> and 5nm. Zen 3 does 3x64-bit loads per cycle with banking. And 153.6 GB/s of L1D load BW is far below
> >250 GB/s that you can get out of a Zen 2/3 core running above 4 GHz (256-bit AVX loads).


Chester go back and read the section on the L1 architecture.
The issue is not three loads per cycle, no-one doubts that and it's easily attested.
The issue is the width and structure of the connections between the L1 and the LSU. It's irrelevant that the LSU can issue three 16B wide loads in a cycle if the L1 can only SERVICE two 16B loads in a cycle.

The very fact that you are discussing this in language (bank conflicts) relevant to the traditional world, not to Apple's cache, tells me you completely miss the point I am trying to resolve.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 02:39 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMGanon2021/11/09 08:02 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:31 PM
      Please don't use the MT graphsAndrei F2021/11/10 03:13 AM
        Please don't use the MT graphs---2021/11/10 10:26 AM
          Followup for Andrei---2021/11/10 06:43 PM
            Followup for AndreiAndrei F2021/11/11 02:30 AM
              Followup for Andrei---2021/11/11 10:21 AM
                Followup for AndreiChester2021/11/11 03:27 PM
                  Followup for Andrei---2021/11/11 03:57 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/09 08:26 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:37 PM
      Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/10 03:12 AM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAMAndrei F2021/11/10 04:12 AM
      Thanks for the dataChester2021/11/10 11:17 AM
        Thanks for the dataAndrei F2021/11/10 01:52 PM
          Thanks for the dataChester2021/11/11 12:16 AM
            Thanks for the dataAndrei F2021/11/11 02:45 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊