Followup for Andrei

By: Andrei F (, November 11, 2021 2:30 am
Room: Moderated Discussions
--- ( on November 10, 2021 5:43 pm wrote:
> --- ( on November 10, 2021 9:26 am wrote:
> > Thanks for the feedback. The bandwidth stuff is coming together nicely, almost done, and soon I'll have
> > latency (where again I'll validate against your graphs, but again details and goals will differ so sure,
> > one should not expect that our results are identical; just that they show the same sort of thing).
> Andrei, I discovered something interesting that you may want to check out within your code!
> If you've been reading along my evolving explanations of what's going on, the following might occur to you:
> In a basic reduction (so a stream of loads), think something like
> for(){sum+=array[i];}
> he address stream being presented to the CPU is a linear stream.
> In the LSU this is split into two sub-streams (directed at even and odd cache lines), but these streams
> are still sequential. So even though we now understand that the L1 has, conceptually, three ways to extract
> data from it, that requires us to feed addresses corresponding to three distinct lines to the L1, so right
> now we're only using that third path under the conditions where we're at the end of the hot line and so
> one load comes from the hot line, and the next load in that stream then goes to the cache proper.
> What if, instead, we generate a stream of somewhat interleaved addresses so that almost
> every cycle of the three addresses presented to the cache, one can hit in the hot line
> and two can hit in the cache proper. Would such a scheme possibly work?
> Oh yes it would! In a perfect world this could give us a full 48B loaded per cycle.
> We can't hit exactly that, but we do get as high as 44B/cycle, 140GB/s!
> So essentially change any reduction code to something like
> for(){sumA+=a[i]; sumB+=b[i];}
> where a and b are widely separated arrays. In a real reduction, the obvious implementation would
> be to split the array in half, set b[] to the halfway point, and perform the final sumA+sumB
> (or whatever) after the loop. Very easy code-mod, gets you an extra 40% performance.
> There remains some weirdness in that the effect is only present for some fraction of the
> cache. I'm not sure quite what the gating factor is but in my particular code I see it
> up to a test depth of 32..64kB, but not beyond that. Still haven't figured that out.
> Do you see something like this on M1 or M1P if you mod your code like this? I suspect it's very much
> an Apple-specific thing, that it will buy you nothing anywhere else. I don't know if you still have
> access to an M1 or M1M to test, but you at least want to add a mod like this to your code base.
> This also seems sufficiently obscure (though remarkably powerful) that even most of Apple seem unaware
> of it. In particular I would expect that the STL semantics for reduction operations (like accumulate(),
> which I used in one of my tests) allow for arbitrary re-ordering of the operations [and one can argue
> that -ffastmath should allow it for even a basic for loop!], and so you would hope that Apple's STL
> would split the reduction into two half arrays as I described, likewise [ambitiously] the compiler would
> split the loop in two and run the upper and lower halves in parallel. But no such luck.
> (Also BTW I implemented my equivalent of CLFlip and see almost exactly the same
> curve as you see out through L2 -- of course beyond that M1M behaves very differently,
> but I see exactly what I would expect for M1's SLC and DRAM.)

I'm just reading into registers and not doing anything with the data, it's not being reduced in my tests. How large is your stride between a[] and b[] in the memory? If you're somehow able to access different banks concurrently or something that might be an explanation. The boost only happening at lower depths that also points out to some sort of physical parallelism limitation in the L1D.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 02:39 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMGanon2021/11/09 08:02 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:31 PM
      Please don't use the MT graphsAndrei F2021/11/10 03:13 AM
        Please don't use the MT graphs---2021/11/10 10:26 AM
          Followup for Andrei---2021/11/10 06:43 PM
            Followup for AndreiAndrei F2021/11/11 02:30 AM
              Followup for Andrei---2021/11/11 10:21 AM
                Followup for AndreiChester2021/11/11 03:27 PM
                  Followup for Andrei---2021/11/11 03:57 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/09 08:26 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:37 PM
      Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/10 03:12 AM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAMAndrei F2021/11/10 04:12 AM
      Thanks for the dataChester2021/11/10 11:17 AM
        Thanks for the dataAndrei F2021/11/10 01:52 PM
          Thanks for the dataChester2021/11/11 12:16 AM
            Thanks for the dataAndrei F2021/11/11 02:45 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊