Followup for Andrei

By: --- (---.delete@this.redheron.com), November 11, 2021 3:57 pm
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on November 11, 2021 2:27 pm wrote:
> --- (---.delete@this.redheron.com) on November 11, 2021 9:21 am wrote:
> > Andrei F (andrei.delete@this.anandtech.com) on November 11, 2021 1:30 am wrote:
> > > --- (---.delete@this.redheron.com) on November 10, 2021 5:43 pm wrote:
> > > > --- (---.delete@this.redheron.com) on November 10, 2021 9:26 am wrote:
> > > >
> > > > > Thanks for the feedback. The bandwidth stuff is coming together nicely, almost done, and soon I'll have
> > > > > latency (where again I'll validate against your graphs, but again details and goals will differ so sure,
> > > > > one should not expect that our results are identical; just that they show the same sort of thing).
> > > >
> > > > Andrei, I discovered something interesting that you may want to check out within your code!
> > > > If you've been reading along my evolving explanations of what's going on, the following might occur to you:
> > > >
> > > > In a basic reduction (so a stream of loads), think something like
> > > > for(){sum+=array[i];}
> > > > he address stream being presented to the CPU is a linear stream.
> > > > In the LSU this is split into two sub-streams (directed at even and odd cache lines), but these streams
> > > > are still sequential. So even though we now understand that the L1 has, conceptually, three ways to extract
> > > > data from it, that requires us to feed addresses corresponding to three distinct lines to the L1, so right
> > > > now we're only using that third path under the conditions where we're at the end of the hot line and so
> > > > one load comes from the hot line, and the next load in that stream then goes to the cache proper.
> > > > What if, instead, we generate a stream of somewhat interleaved addresses so that almost
> > > > every cycle of the three addresses presented to the cache, one can hit in the hot line
> > > > and two can hit in the cache proper. Would such a scheme possibly work?
> > > >
> > > > Oh yes it would! In a perfect world this could give us a full 48B loaded per cycle.
> > > > We can't hit exactly that, but we do get as high as 44B/cycle, 140GB/s!
> > > >
> > > > So essentially change any reduction code to something like
> > > > for(){sumA+=a[i]; sumB+=b[i];}
> > > > where a and b are widely separated arrays. In a real reduction, the obvious implementation would
> > > > be to split the array in half, set b[] to the halfway point, and perform the final sumA+sumB
> > > > (or whatever) after the loop. Very easy code-mod, gets you an extra 40% performance.
> > > >
> > > > There remains some weirdness in that the effect is only present for some fraction of the
> > > > cache. I'm not sure quite what the gating factor is but in my particular code I see it
> > > > up to a test depth of 32..64kB, but not beyond that. Still haven't figured that out.
> > > >
> > > > Do you see something like this on M1 or M1P if you mod your code like this? I suspect it's very much
> > > > an Apple-specific thing, that it will buy you nothing anywhere else. I don't know if you still have
> > > > access to an M1 or M1M to test, but you at least want to add a mod like this to your code base.
> > > >
> > > > This also seems sufficiently obscure (though remarkably powerful) that even most of Apple seem unaware
> > > > of it. In particular I would expect that the STL semantics for reduction operations (like accumulate(),
> > > > which I used in one of my tests) allow for arbitrary re-ordering of the operations [and one can argue
> > > > that -ffastmath should allow it for even a basic for loop!], and so you would hope that Apple's STL
> > > > would split the reduction into two half arrays as I described, likewise [ambitiously] the compiler would
> > > > split the loop in two and run the upper and lower halves in parallel. But no such luck.
> > > >
> > > > (Also BTW I implemented my equivalent of CLFlip and see almost exactly the same
> > > > curve as you see out through L2 -- of course beyond that M1M behaves very differently,
> > > > but I see exactly what I would expect for M1's SLC and DRAM.)
> > > >
> > >
> > > I'm just reading into registers and not doing anything with the data, it's not being reduced in my
> > > tests. How large is your stride between a[] and b[] in the memory? If you're somehow able to access
> > > different banks concurrently or something that might be an explanation. The boost only happening at
> > > lower depths that also points out to some sort of physical parallelism limitation in the L1D.
>
> Another explanation is when you're reading and writing within a small enough area, BW may be amplified
> by store forwarding. In other words, if a load comes shortly after a write to the same address, the load
> doesn't need a L1D access - it just gets data from the store buffer. I assume you can get very close to
> 48B/cycle load bandwidth if at least one of every three loads gets its data via store forwarding.

Only if you are loading after storing.This is an unrealistic data pattern (it generally corresponds to sub-optimal code), and that particular path, while it exists for correctness, is not useful for performance.
Certainly my code (and I assume Andrei's) is doing
load x@A, store x@B not store x@A, load x@A


> Or, if you're only doing loads in a very small area, it's possible two loads going to the same location can both
> be waiting on data from the same address, and a single load result from the L1D can complete both loads.

Very small area means 32kIB to 64kIB, not one or two cache lines.

> Both of those possibilities stand about to me because M1 AGU throughput (3) and
> L1D load ports (2) are mismatched. If you're looping over a small region, you'll
> inevitably have a lot of loads queued up with known addresses waiting on data.
>
> > The reduction is only to make the test a little more real-world,
> > and to prevent the compiler from optimizing away raw reads.
>
> You don't need to do that if you're using assembly directly.
>
> You do need to sink the result if you're using intrinsics in C, or plain C code. With the former, you
> basically get the same results with increased power draw (actually a problem with some chips like Rocket
> Lake). I don't recommend plain C code because it's very hard to control what the compiler emits.

Oh Chester. You do like to teach the crocodile where the water is, don't you :-)


> > The essential loop is vector load pairs (what we are testing) and vector adds (which we get for free).
> > The part that matters is that we gets accesses to different
> > lines, so you want a and b to be at least a line apart.
> > I do not believe this is a bank issue as in the PC world. I've given my long explanation of how
> > the Apple L1 works in my document, and why I believe it that very different structure (based
> > on experiments, patents, and best-practices in the literature for reducing energy usage).
> > My best guess as to why I see this over a restricted range of the cache (but a range that
> > varies somewhat depending on exactly what I do) is that it has to do with whatever code is
> > tracking the stride access pattern that makes it considered viable to use a hot line.
> > Possibly it's something like with restricted enough access (access limited to only half or a quarter of the
> > L1) the other lines are put to sleep and the hot line mechanism is used; but when we are not in this mode
> > the heuristic is that "cache access is variable enough that hot line will not overall be valuable"?
> >
> > This is still something I need to keep looking at. But I honestly think that
> > considering this as a variant of the traditional PC L1 does not help us.
>
> Banking has nothing to do with PC vs Mac vs whatever. It's a common technique used for decades to service
> multiple accesses per cycle without the area and power expense of truly multi-porting a structure. More
> generally, there's a surprising amount of commonality in techniques used by various different CPU makers.
> Just because something shows up in a patent, doesn't mean it's implemented anywhere.

Seriously dude! Just read my document. Or don't.
But don't repeatedly claim you have read it, and then display complete ignorance of its contents.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 02:39 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMGanon2021/11/09 08:02 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:31 PM
      Please don't use the MT graphsAndrei F2021/11/10 03:13 AM
        Please don't use the MT graphs---2021/11/10 10:26 AM
          Followup for Andrei---2021/11/10 06:43 PM
            Followup for AndreiAndrei F2021/11/11 02:30 AM
              Followup for Andrei---2021/11/11 10:21 AM
                Followup for AndreiChester2021/11/11 03:27 PM
                  Followup for Andrei---2021/11/11 03:57 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/09 08:26 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:37 PM
      Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/10 03:12 AM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAMAndrei F2021/11/10 04:12 AM
      Thanks for the dataChester2021/11/10 11:17 AM
        Thanks for the dataAndrei F2021/11/10 01:52 PM
          Thanks for the dataChester2021/11/11 12:16 AM
            Thanks for the dataAndrei F2021/11/11 02:45 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊