Please don't use the MT graphs

By: --- (---.delete@this.redheron.com), November 10, 2021 10:26 am
Room: Moderated Discussions
Andrei F (andrei.delete@this.anandtech.com) on November 10, 2021 2:13 am wrote:
> --- (---.delete@this.redheron.com) on November 9, 2021 9:31 pm wrote:
> > Ganon (anon.delete@this.gmail.com) on November 9, 2021 7:02 pm wrote:
> > > --- (---.delete@this.redheron.com) on November 9, 2021 1:39 pm wrote:
> > > > I have no idea how many people here read my (ongoing, the public version is only
> > > > version 0.7) exegesis of M1 internals. But those who have read the entire thing
> > > > (all 300+ pages!) will remember an on-going bafflement regarding the L1 cache.
> > > >
> > >
> > >
> > > Thoroughly enjoyed the read; looking forward to the next update. Regarding
> > > m1 pro/max; seems some things have changed at least according to
> > >
> > > https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2
> > >
> > > where a single core has >100GB/s all the way from L1 to DRAM; even better
> > > than M1.
> >
> >
> > That graph seems to be measuring something different from the first Anandtech graph, the graph for the M1.
> > The M1 graph was, as far as I can tell, for pure *load*
> > performance (at least that's my case that it matches
> > most closely), and that's at least the most obvious case when you look at the Intel graphs I referenced.
> >
>
> Both the M1 and M1 Max are the same test.
>
> Please don't use my MT graphs for detailed analysis in the L1, it's a multi-threaded
> test and has overhead at small depths and you can't get detailed data there.
>
> https://i.imgur.com/dL7s5vX.png

For *my* analysis, my concern is to see that I match you, as a sanity check that my code is correct, before I start trying to understand the results.
Essentially I see the same thing.
I am using ldp/stp of vectors rather than just ldr/str, which may be necessary to get that last 10% boost in the L1. And of course I don't have any threading overhead which may also help.

I see the same phenomenon that you see for stores, as I pointed out, that the overhead that reduces load bandwidth from L2 to L1 by about 20% is absent, so you can stream stores to L2 at essentially the same rate as L1.
I also see the phenomenon you see where loads and stores within the same line (CFLFlip) do a whole lot better.

One phenomenon I discovered (I don't know if you correct for this, but it may explain the weirder pars of your curves like the store dip for 64 to 1024, or the oscillations in Flip, is that the self-tuning heuristics (slow down either the CPU or the DRAM in the face of less 'demand'), while probably appropriate for most code, get confused by code like this!
I saw some of the sort of thing you are seeing, and fought it by re-ordering the tests, running them multiple times, and choosing the best of the selection. So, for example, I have some load tests that look like yours with the middle dip, but rearrange that test to go before load, and you can get a test that's flat at 100GB/s from 0 out 10+MiB. It obviously depends (don't know how far the history goes back) on what has been happening earlier, and it can also move depending on how many iterations you give each test. Ah, the joys of benchmarking.

(Note that this is auto-tuning at the CPU level at a very high frequency. The DRAM does it by dropping the DRAM frequency by exactly 2, which allows very rapidly toggling. But that one's not relevant here.
The CPU does it, I think, by halting clocks and so throttling how fast the LSU can operate, and I think that's where the problematic heuristics come in;
- prior code has been hitting the DRAM hard
- the heuristics assume it makes sense to throttle the CPU (since for "normal" code might as well save energy while you are mostly waiting on DRAM)
- and the heuristics take long enough to make their decision and kick in that we're already onto in-cache testing of the next test.

I agree this seems to imply a disturbingly long gap between when the memory-accessing code was hot and when the CPU slows down, but maybe for realistic code that sort of delay is reasonable? I found a long run of stores gave the most variable looking curves, but I also hit a case where code that was a mix of loads and stores (essentially something like
vecD=vecA op vecB op vecC, so three loads to a store) managed to drop to a datapoint of 12GB/s! which I assume was some pessimal combination of the DRAM being slowed down then the CPU being slowed down, and both deciding napping like that was fun and neither urging the other to wake up!

If (of course who knows right now) Thread Director is purely informational, then presumably we will not see this sort of thing in ADL since I think at the lower frequency the OS operates it would not make decisions like this?
But honestly who knows? Maybe the HW is blameless here and it is indeed, in both cases, the OS moving DVFS up and down? That just seems like it would be sufficiently disruptive you would really see it in the graphs? But maybe not when a DVFS glitch is swamped in 1000 iterations of a loop?

> That would be a more accurate showcase on the M1 Max for example.
>
> The test works in 64B chunks, doesn't matter for pure LD or pure ST,
>
> Flip = copy/alter from one region to another (flip the memory in 8B elements around)
> CLflip = flip 8B elements within 64B chunks around, read and write to same cachelines
>
> In terms of the store bandwidth, you can't properly measure it as the fabric transforms write into non-temporal
> ones. We've had this before with >A76 cores where the store bandwidth is almost 100% of theoretical.

Again I care about your graphs only for validating my code! In particular I don't care about how people like McCalpin have defined a "store bandwidth" and claims that you can't measure it properly on decent ARM cores; my interest is not in "what are the McCalpin equivalent numbers" but in "what we can say about the machine given the graphs we see".

Thanks for the feedback. The bandwidth stuff is coming together nicely, almost done, and soon I'll have latency (where again I'll validate against your graphs, but again details and goals will differ so sure, one should not expect that our results are identical; just that they show the same sort of thing).
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 02:39 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMGanon2021/11/09 08:02 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:31 PM
      Please don't use the MT graphsAndrei F2021/11/10 03:13 AM
        Please don't use the MT graphs---2021/11/10 10:26 AM
          Followup for Andrei---2021/11/10 06:43 PM
            Followup for AndreiAndrei F2021/11/11 02:30 AM
              Followup for Andrei---2021/11/11 10:21 AM
                Followup for AndreiChester2021/11/11 03:27 PM
                  Followup for Andrei---2021/11/11 03:57 PM
  Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/09 08:26 PM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAM---2021/11/09 10:37 PM
      Detailed investigation of M1 load and store bandwidths from L1 out to DRAMChester2021/11/10 03:12 AM
    Detailed investigation of M1 load and store bandwidths from L1 out to DRAMAndrei F2021/11/10 04:12 AM
      Thanks for the dataChester2021/11/10 11:17 AM
        Thanks for the dataAndrei F2021/11/10 01:52 PM
          Thanks for the dataChester2021/11/11 12:16 AM
            Thanks for the dataAndrei F2021/11/11 02:45 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊