lmbench is horribly broken

By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), March 17, 2017 1:10 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 16, 2017 6:49 pm wrote:
>
> They aren't guaranteed to miss the TLB at all, actually. And the patterns when they do miss can be quite interesting.
> I made my own version that used the same physical page as a backing store exactly because I was looking at
> those kinds of things and wanted to see what was the D$ effect, and what was the TLB effect.

Ok, so I recreated a cut-down version of the test, just for fun.

It is a loop that runs for one second, with the critical sequence essentially being (note the "unix syntax": source first, destination second):

mov (%r14,%rbp,1),%r9d
add %rbp,%r9
and %r13,%r9
mov %r9,%rbp

which is just loading a relative next-pointer index from memory, with an "add" because it's a relative offset, and an "and" to keep the offset inside the buffer.

It's all the same page mapped over and over again, with the contents of the page being just a repeated word with the stride (which you can then set). In other words, it's designed exactly to get the worst-case TLB behavior, with a data dependency to make sure you get a real latency.

The best case latency for my loop looks like 7 cycles on my 6th gen core, and I think it ends up being pretty close to the above critical sequence (4-cycle load-to-use latency, plus one cycle latency for the chain of add->and->load).

I verified that by using a 16kB area with a 4-byte stride, which should be pretty much optimal. It definitely fits in cache (only 4kB of actual physical memory used: the 16kB area is four pages mapped end to end), and it obviously fits in the TLB too (only four pages mapped).

So I get

[torvalds@i7 test-tlb]$ ./test-tlb 16k 4
556.995678 iterations (~7.0 cycles) in one second

for that best case (that dot in the output is just to make it visually easier to see the million mark: it's about 557 million iterations per second on a CPU that runs at 3.9GHz).

What happens when I make the stride be 4k, to get maximal TLB use, and then vary the size of the buffer? (Note: none of this uses large pages, since it's mapping the same 4kB page over and over again).

Here's what I get:

4k: 557.788382 iterations (~7.0 cycles)
8k: 557.948977 iterations (~7.0 cycles)
16k: 557.850215 iterations (~7.0 cycles)
32k: 557.955472 iterations (~7.0 cycles)
64k: 557.869825 iterations (~7.0 cycles)
128k: 557.872359 iterations (~7.0 cycles)
256k: 521.434123 iterations (~7.5 cycles)
512k: 244.176667 iterations (~16.0 cycles)
1M: 244.359283 iterations (~16.0 cycles)
2M: 244.178326 iterations (~16.0 cycles)
4M: 244.013284 iterations (~16.0 cycles)
8M: 150.342708 iterations (~25.9 cycles)
16M: 149.726365 iterations (~26.0 cycles)
32M: 144.814447 iterations (~26.9 cycles)
64M: 144.093650 iterations (~27.1 cycles)
128M: 142.570533 iterations (~27.4 cycles)

which is for 1, 2, 4, 8 ... TLB entries in the mapping.

So you get that perfect thing up to 256kB, which is 64 TLB entries. That sounds right: that should be the correct L1 dTLB size on Skylake. Yes, at the end there you can start seeing that you have other TLB activity too, so it reports ~7.5 cycles, but it's pretty damn clear.

So that's when e L2 TLB kicks in, and a TLB miss in the L1 and hit in L2 adds nine cycles. Remember: this code is literally designed to test the worst case, and 9 cycles isn't too bad. The L2 dTLB on Skylake is reported to be 1536, so you should see something change at around the 6M mark. And indeed, look at what happens between the 4M and 8M buffer size.

Now this is where a good TLB fill comes in. Missing entirely in the TLB adds only another 10 cycles. That should make you go "wow". That's 10 extra CPU cycles. TEN.

In other words, if you actually do memory latency testing, you basically won't even see it. It's 3ns. It's not even noticeable in memory latency.

So Wilco - I'm serious. If you complain that lmbench shows TLB effects, and you think that means that lmbench is crap, you really really need to learn better.

It's not lmbench that is crap. It's your CPU.

It really is that simple.

If you want the source-code, email me. It's a quick hack that I wrote up, and you'll have to hardcode your frequency number for the "cycles" count, but you can test it out yourself. Maybe you can find a bug in my program, but I've seen that "just a handful of cycles" thing before. I just wanted to re-check it.

And I suspect you might want to ask yourself some serious questions about your ARM fanboyism, instead of blaming benchmarks. Maybe that ARM core you like so much wasn't doing so well after all, and it really wasn't the benchmark that was broken?

And that benchmark you liked so much better - maybe it wasn't "better" at all? Maybe it was in fact worse?

Linus
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
ARM A73 benchmarksSymmetry03/14/17 06:24 AM
  ARM A73 benchmarksPer Hesselgren03/14/17 07:18 AM
    ARM A73 benchmarks-latencyPer Hesselgren03/14/17 08:58 AM
      ARM A73 benchmarks-latencySymmetry03/14/17 10:12 AM
        ARM A73 benchmarks-latencyPer Hesselgren03/14/17 03:54 PM
          ARM A73 benchmarks-latencyWilco03/15/17 01:45 AM
            ARM A73 benchmarks-latencyPer Hesselgren03/15/17 02:57 AM
              ARM A73 benchmarks-latencyPer Hesselgren03/15/17 03:00 AM
                ARM A73 benchmarks-latencyPer Hesselgren03/15/17 03:01 AM
                  clickable linkMichael S03/15/17 04:05 AM
            ARM A73 benchmarks-latencyLinus Torvalds03/15/17 10:05 AM
              ARM A73 benchmarks-latencyIreland03/15/17 05:02 PM
              ARM A73 benchmarks-latencyGabriele Svelto03/16/17 03:45 AM
                ARM A73 benchmarks-latencyLinus Torvalds03/16/17 02:01 PM
                  lmbench is horribly brokenWilco03/16/17 04:57 PM
                    lmbench is horribly brokenLinus Torvalds03/16/17 06:49 PM
                      lmbench is horribly brokenLinus Torvalds03/17/17 01:10 PM
                        lmbench is horribly brokenLinus Torvalds03/17/17 01:52 PM
                        lmbench is horribly brokenExophase03/17/17 02:31 PM
                          lmbench is horribly brokenGabriele Svelto03/17/17 03:20 PM
                          lmbench is horribly brokenLinus Torvalds03/17/17 05:56 PM
                            lmbench is horribly brokenExophase03/17/17 06:21 PM
                              lmbench is horribly brokenLinus Torvalds03/17/17 06:43 PM
                                lmbench is horribly brokenIreland03/17/17 07:37 PM
                                  lmbench is horribly brokenbakaneko03/18/17 11:17 AM
                                    lmbench is horribly brokenIreland03/18/17 12:23 PM
                                      lmbench is horribly brokenanon03/18/17 07:35 PM
                                      lmbench is horribly brokenbakaneko03/21/17 08:08 AM
                                        lmbench is horribly brokenIreland03/21/17 03:14 PM
                                lmbench is horribly brokenGabriele Svelto03/18/17 04:01 PM
                                  accessing dram RichardC03/18/17 06:33 PM
                                lmbench is horribly brokenExophase03/18/17 04:26 PM
                                  lmbench is horribly brokenWilco03/18/17 05:40 PM
                                    benchmarking reality?Anon03/19/17 02:29 PM
                                    lmbench is horribly brokenLinus Torvalds03/19/17 04:25 PM
                                      mea culpa (lmbench is horribly broken)Linus Torvalds03/19/17 06:05 PM
                                        mea culpa (lmbench is horribly broken)Bill Broadley03/21/17 01:41 AM
                                          mea culpa (lmbench is horribly broken)Linus Torvalds03/21/17 09:01 AM
                                            mea culpa (lmbench is horribly broken)Linus Torvalds03/21/17 11:14 AM
                                            mea culpa (lmbench is horribly broken)Linus Torvalds03/21/17 05:03 PM
                                              mea culpa (lmbench is horribly broken)Etienne03/22/17 04:37 AM
                                              mea culpa (lmbench is horribly broken)Tim McCaffrey03/22/17 08:54 AM
                                                mea culpa (lmbench is horribly broken)Tim McCaffrey03/22/17 09:34 AM
                                                mea culpa (lmbench is horribly broken)Linus Torvalds03/22/17 10:35 AM
                                                  mea culpa (lmbench is horribly broken)Ireland03/22/17 12:11 PM
                                                    mea culpa (lmbench is horribly broken)Ireland03/22/17 12:26 PM
                                                    mea culpa (lmbench is horribly broken)rwessel03/22/17 03:03 PM
                                                      mea culpa (lmbench is horribly broken)Ireland03/22/17 03:35 PM
                                                  mea culpa (lmbench is horribly broken)Linus Torvalds03/22/17 01:35 PM
                                                    mea culpa (lmbench is horribly broken)Gabriele Svelto03/23/17 08:05 AM
                                                      mea culpa (lmbench is horribly broken)Linus Torvalds03/23/17 10:43 AM
                                                        mea culpa (lmbench is horribly broken)Gabriele Svelto03/23/17 01:56 PM
                                                          mea culpa (lmbench is horribly broken)Ireland03/23/17 02:36 PM
                                                  mea culpa (lmbench is horribly broken)Travis03/22/17 01:38 PM
                                              mea culpa (lmbench is horribly broken)anon03/22/17 07:22 PM
                                                mea culpa (lmbench is horribly broken)Travis03/22/17 08:57 PM
                                                  mea culpa (lmbench is horribly broken)anon03/23/17 12:44 AM
                                                    mea culpa (lmbench is horribly broken)Michael S03/23/17 05:59 PM
                                                      mea culpa (lmbench is horribly broken)Travis03/23/17 09:03 PM
                                                    power8 numbersoctoploid03/24/17 11:47 PM
                                                      power8 numbers stride=128octoploid03/25/17 04:36 AM
                                                        power8 numbers stride=128Linus Torvalds03/25/17 10:50 AM
                                                          power8 numbers stride=128Gabriele Svelto03/25/17 11:27 PM
                                              mea culpa (lmbench is horribly broken)anon03/23/17 01:14 AM
                                                mea culpa (lmbench is horribly broken)Linus Torvalds03/23/17 11:22 AM
                                                  Thank you. Associativity misses explain it.anon03/23/17 10:48 PM
                                                    Thank you. Associativity misses explain it.Linus Torvalds03/24/17 01:26 PM
                                                      Thank you. Associativity misses explain it.Travis03/24/17 10:01 PM
                                                        thanks should read "but if it is any TYPE of mix" (NT)Travis03/24/17 10:02 PM
                                                        Thank you. Associativity misses explain it.Linus Torvalds03/25/17 12:10 PM
                                                          Thank you. Associativity misses explain it.Travis03/25/17 04:08 PM
                                                            Thank you. Associativity misses explain it.Linus Torvalds03/26/17 10:27 AM
                                  lmbench is horribly brokenLinus Torvalds03/19/17 03:51 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?