lmbench is horribly broken

By: Exophase (exophase.delete@this.gmail.com), March 17, 2017 2:31 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 17, 2017 1:10 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 16, 2017 6:49 pm wrote:
> >
> > They aren't guaranteed to miss the TLB at all, actually. And
> > the patterns when they do miss can be quite interesting.
> > I made my own version that used the same physical page as a backing store exactly because I was looking at
> > those kinds of things and wanted to see what was the D$ effect, and what was the TLB effect.
> Ok, so I recreated a cut-down version of the test, just for fun.
> It is a loop that runs for one second, with the critical sequence essentially
> being (note the "unix syntax": source first, destination second):
> mov (%r14,%rbp,1),%r9d
> add %rbp,%r9
> and %r13,%r9
> mov %r9,%rbp
> which is just loading a relative next-pointer index from memory, with an "add" because
> it's a relative offset, and an "and" to keep the offset inside the buffer.
> It's all the same page mapped over and over again, with the contents of the page being just a repeated
> word with the stride (which you can then set). In other words, it's designed exactly to get the
> worst-case TLB behavior, with a data dependency to make sure you get a real latency.
> The best case latency for my loop looks like 7 cycles on my 6th gen core, and
> I think it ends up being pretty close to the above critical sequence (4-cycle
> load-to-use latency, plus one cycle latency for the chain of add->and->load).
> I verified that by using a 16kB area with a 4-byte stride, which should be pretty much optimal.
> It definitely fits in cache (only 4kB of actual physical memory used: the 16kB area is four
> pages mapped end to end), and it obviously fits in the TLB too (only four pages mapped).
> So I get
> [torvalds@i7 test-tlb]$ ./test-tlb 16k 4
> 556.995678 iterations (~7.0 cycles) in one second
> for that best case (that dot in the output is just to make it visually easier to see the million
> mark: it's about 557 million iterations per second on a CPU that runs at 3.9GHz).
> What happens when I make the stride be 4k, to get maximal TLB use, and then vary the size of the buffer?
> (Note: none of this uses large pages, since it's mapping the same 4kB page over and over again).
> Here's what I get:
> 4k: 557.788382 iterations (~7.0 cycles)
> 8k: 557.948977 iterations (~7.0 cycles)
> 16k: 557.850215 iterations (~7.0 cycles)
> 32k: 557.955472 iterations (~7.0 cycles)
> 64k: 557.869825 iterations (~7.0 cycles)
> 128k: 557.872359 iterations (~7.0 cycles)
> 256k: 521.434123 iterations (~7.5 cycles)
> 512k: 244.176667 iterations (~16.0 cycles)
> 1M: 244.359283 iterations (~16.0 cycles)
> 2M: 244.178326 iterations (~16.0 cycles)
> 4M: 244.013284 iterations (~16.0 cycles)
> 8M: 150.342708 iterations (~25.9 cycles)
> 16M: 149.726365 iterations (~26.0 cycles)
> 32M: 144.814447 iterations (~26.9 cycles)
> 64M: 144.093650 iterations (~27.1 cycles)
> 128M: 142.570533 iterations (~27.4 cycles)
> which is for 1, 2, 4, 8 ... TLB entries in the mapping.
> So you get that perfect thing up to 256kB, which is 64 TLB entries. That sounds right: that
> should be the correct L1 dTLB size on Skylake. Yes, at the end there you can start seeing that
> you have other TLB activity too, so it reports ~7.5 cycles, but it's pretty damn clear.
> So that's when e L2 TLB kicks in, and a TLB miss in the L1 and hit in L2 adds nine cycles.
> Remember: this code is literally designed to test the worst case, and 9 cycles isn't too
> bad. The L2 dTLB on Skylake is reported to be 1536, so you should see something change at
> around the 6M mark. And indeed, look at what happens between the 4M and 8M buffer size.
> Now this is where a good TLB fill comes in. Missing entirely in the TLB adds only another
> 10 cycles. That should make you go "wow". That's 10 extra CPU cycles. TEN.
> In other words, if you actually do memory latency testing, you basically won't
> even see it. It's 3ns. It's not even noticeable in memory latency.
> So Wilco - I'm serious. If you complain that lmbench shows TLB effects, and you
> think that means that lmbench is crap, you really really need to learn better.
> It's not lmbench that is crap. It's your CPU.
> It really is that simple.
> If you want the source-code, email me. It's a quick hack that I wrote up, and you'll have to hardcode your
> frequency number for the "cycles" count, but you can test it out yourself. Maybe you can find a bug in my
> program, but I've seen that "just a handful of cycles" thing before. I just wanted to re-check it.
> And I suspect you might want to ask yourself some serious questions about your ARM
> fanboyism, instead of blaming benchmarks. Maybe that ARM core you like so much wasn't
> doing so well after all, and it really wasn't the benchmark that was broken?
> And that benchmark you liked so much better - maybe it wasn't "better" at all? Maybe it was in fact worse?
> Linus

You're using a unit stride and it's clearly engaging the prefetcher. So you're testing:

cache hit + TLB miss + TLB cache hit

It might not actually be the full TLB miss latency either, it's possible that the prefetch is starting the L2 TLB load early.

While this is what's being done in lmbench's unit stride it's useless for measuring memory latency since memory is never even touched. There was a time where you could set the stride large enough to defeat the prefetcher but that appears to no longer be the case.

So you argue that it demonstrates that the following:

LLC miss + TLB miss + TLB cache hit

Is going to be close to a memory latency test in a competent uarch because TLB miss should be very small compared to memory latency. That's a fair argument. But when Wilco is criticizing lmbench he's probably talking about its thrash_initialize mode, which uses random access instead of strided access. Which most likely results in:

LLC miss + TLB miss + TLB LLC miss

That is NOT close to a memory latency test. That is, like Wilco said, more like 2x a memory latency test.

In order to get something that really only measures one main memory load you need the TLB data to be in the cache. To do this you want a test that for example uses random accesses within a limited enough range of space to not miss the TLB. Possibly using large pages. GB4's memory latency test uses such a design.

Now yes, of course you can argue that the double memory access case is useful too, but I have a suspicion that in the real world it's more common on LLC misses to hit the TLB or at least the cache when loading the TLB than it is not. And if you really want to measure TLB miss performance that's not a great way to do it since a main memory access will look like a long time even compared to a pretty slow page walker, assuming it actually uses the caches to begin with (and you'll be hard pressed to find modern uarchs with multi-level caches that don't)
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
ARM A73 benchmarksSymmetry2017/03/14 06:24 AM
  ARM A73 benchmarksPer Hesselgren2017/03/14 07:18 AM
    ARM A73 benchmarks-latencyPer Hesselgren2017/03/14 08:58 AM
      ARM A73 benchmarks-latencySymmetry2017/03/14 10:12 AM
        ARM A73 benchmarks-latencyPer Hesselgren2017/03/14 03:54 PM
          ARM A73 benchmarks-latencyWilco2017/03/15 01:45 AM
            ARM A73 benchmarks-latencyPer Hesselgren2017/03/15 02:57 AM
              ARM A73 benchmarks-latencyPer Hesselgren2017/03/15 03:00 AM
                ARM A73 benchmarks-latencyPer Hesselgren2017/03/15 03:01 AM
                  clickable linkMichael S2017/03/15 04:05 AM
            ARM A73 benchmarks-latencyLinus Torvalds2017/03/15 10:05 AM
              ARM A73 benchmarks-latencyIreland2017/03/15 05:02 PM
              ARM A73 benchmarks-latencyGabriele Svelto2017/03/16 03:45 AM
                ARM A73 benchmarks-latencyLinus Torvalds2017/03/16 02:01 PM
                  lmbench is horribly brokenWilco2017/03/16 04:57 PM
                    lmbench is horribly brokenLinus Torvalds2017/03/16 06:49 PM
                      lmbench is horribly brokenLinus Torvalds2017/03/17 01:10 PM
                        lmbench is horribly brokenLinus Torvalds2017/03/17 01:52 PM
                        lmbench is horribly brokenExophase2017/03/17 02:31 PM
                          lmbench is horribly brokenGabriele Svelto2017/03/17 03:20 PM
                          lmbench is horribly brokenLinus Torvalds2017/03/17 05:56 PM
                            lmbench is horribly brokenExophase2017/03/17 06:21 PM
                              lmbench is horribly brokenLinus Torvalds2017/03/17 06:43 PM
                                lmbench is horribly brokenIreland2017/03/17 07:37 PM
                                  lmbench is horribly brokenbakaneko2017/03/18 11:17 AM
                                    lmbench is horribly brokenIreland2017/03/18 12:23 PM
                                      lmbench is horribly brokenanon2017/03/18 07:35 PM
                                      lmbench is horribly brokenbakaneko2017/03/21 08:08 AM
                                        lmbench is horribly brokenIreland2017/03/21 03:14 PM
                                lmbench is horribly brokenGabriele Svelto2017/03/18 04:01 PM
                                  accessing dram RichardC2017/03/18 06:33 PM
                                lmbench is horribly brokenExophase2017/03/18 04:26 PM
                                  lmbench is horribly brokenWilco2017/03/18 05:40 PM
                                    benchmarking reality?Anon2017/03/19 02:29 PM
                                    lmbench is horribly brokenLinus Torvalds2017/03/19 04:25 PM
                                      mea culpa (lmbench is horribly broken)Linus Torvalds2017/03/19 06:05 PM
                                        mea culpa (lmbench is horribly broken)Bill Broadley2017/03/21 01:41 AM
                                          mea culpa (lmbench is horribly broken)Linus Torvalds2017/03/21 09:01 AM
                                            mea culpa (lmbench is horribly broken)Linus Torvalds2017/03/21 11:14 AM
                                            mea culpa (lmbench is horribly broken)Linus Torvalds2017/03/21 05:03 PM
                                              mea culpa (lmbench is horribly broken)Etienne2017/03/22 04:37 AM
                                              mea culpa (lmbench is horribly broken)Tim McCaffrey2017/03/22 08:54 AM
                                                mea culpa (lmbench is horribly broken)Tim McCaffrey2017/03/22 09:34 AM
                                                mea culpa (lmbench is horribly broken)Linus Torvalds2017/03/22 10:35 AM
                                                  mea culpa (lmbench is horribly broken)Ireland2017/03/22 12:11 PM
                                                    mea culpa (lmbench is horribly broken)Ireland2017/03/22 12:26 PM
                                                    mea culpa (lmbench is horribly broken)rwessel2017/03/22 03:03 PM
                                                      mea culpa (lmbench is horribly broken)Ireland2017/03/22 03:35 PM
                                                  mea culpa (lmbench is horribly broken)Linus Torvalds2017/03/22 01:35 PM
                                                    mea culpa (lmbench is horribly broken)Gabriele Svelto2017/03/23 08:05 AM
                                                      mea culpa (lmbench is horribly broken)Linus Torvalds2017/03/23 10:43 AM
                                                        mea culpa (lmbench is horribly broken)Gabriele Svelto2017/03/23 01:56 PM
                                                          mea culpa (lmbench is horribly broken)Ireland2017/03/23 02:36 PM
                                                  mea culpa (lmbench is horribly broken)Travis2017/03/22 01:38 PM
                                              mea culpa (lmbench is horribly broken)anon2017/03/22 07:22 PM
                                                mea culpa (lmbench is horribly broken)Travis2017/03/22 08:57 PM
                                                  mea culpa (lmbench is horribly broken)anon2017/03/23 12:44 AM
                                                    mea culpa (lmbench is horribly broken)Michael S2017/03/23 05:59 PM
                                                      mea culpa (lmbench is horribly broken)Travis2017/03/23 09:03 PM
                                                    power8 numbersoctoploid2017/03/24 11:47 PM
                                                      power8 numbers stride=128octoploid2017/03/25 04:36 AM
                                                        power8 numbers stride=128Linus Torvalds2017/03/25 10:50 AM
                                                          power8 numbers stride=128Gabriele Svelto2017/03/25 11:27 PM
                                              mea culpa (lmbench is horribly broken)anon2017/03/23 01:14 AM
                                                mea culpa (lmbench is horribly broken)Linus Torvalds2017/03/23 11:22 AM
                                                  Thank you. Associativity misses explain it.anon2017/03/23 10:48 PM
                                                    Thank you. Associativity misses explain it.Linus Torvalds2017/03/24 01:26 PM
                                                      Thank you. Associativity misses explain it.Travis2017/03/24 10:01 PM
                                                        thanks should read "but if it is any TYPE of mix" (NT)Travis2017/03/24 10:02 PM
                                                        Thank you. Associativity misses explain it.Linus Torvalds2017/03/25 12:10 PM
                                                          Thank you. Associativity misses explain it.Travis2017/03/25 04:08 PM
                                                            Thank you. Associativity misses explain it.Linus Torvalds2017/03/26 10:27 AM
                                  lmbench is horribly brokenLinus Torvalds2017/03/19 03:51 PM
Reply to this Topic
Body: No Text
How do you spell green?