less painful, and maybe even more effective, options

By: Travis (travis.downs.delete@this.gmail.com),
Room: Moderated Discussions
Jeff S. (fakity.delete@this.fake.com) on August 4, 2018 1:46 pm wrote:
> Adrian (a.delete@this.acm.org) on August 3, 2018 7:08 am wrote:
> > Travis (travis.downs.delete@this.gmail.com) on August 2, 2018 11:06 am wrote:
> > > BTW, if anyone has any recommendation on how to boot into a really "quiet" system for
> > > benchmarking, I'd be happy to hear it. Certainly it should be non-GUI, but that's probably
> > > not enough on a mainstream distro since there will still be a lot of background services,
> > > etc, running. Maybe just a cut down distro that defaults to everything off?
> > I suppose that rebooting in single user mode is the best you can do without using some special kernel.
> > On the kinds of operating systems that I am using, this is done with "init
> > 1". On systems with systemd, that is more complex. You should study your local
>
> You can go indefinitely down the rabbit hole, but quite a lot can be accomplished with:
> - disabling C-states in BIOS to keep your system from trying to flush and sleeps its caches
> - booting with the 'isolcpus' kernel parameter to keep some
> set of cores away from the dynamic scheduler entirely
> - booting with tickless/'nohz_full' kernel parameter to get unbounded time quanta on threads given their
> own cores
> - disabling the irqbalance service and pinning IRQ handling to a "dirty" core
> - booting with 'rcu_nocbs' and 'rcu_nocb_poll' to similarly redirect RCU spam to a dirty core

I have already done most of these things. These are important for "on core noise" (i.e., they solve things that affect even core-bound code), but I've got that close to zero anyways. In particular, if you do short tests, 99% of the time you won't get interrupted anyways and you can detect the cases you did and throw them out (although it seems you can't detect SMC interrupts except indirectly).

So I am good in that respect. I'm looking for an approach that reduces noise on the shared part of the system for tests that stress L3 or memory.

Two notes on your list:

1) I didn't find isolcpus to really make any difference in practice for benchmarks. If you have enough cores and pin your task the scheduler will never interrupt it anyways, at least to a first order approximation. I used it for a while, but never really saw any difference and removed it.

2) I found nohz_full to be *worse* than without for my use-case. Now either on or off is fine for small benchmarks as described above, but I saw more variance with nohz_full for longer benchmarks.

Eventually I traced it down to this[1]: even when using nohz_full one CPU has to receive the timer tick periodically to keep track of whatever the kernel needs to keep track of periodically no matter what. If you are using the normal approach (without nohz_full), it is your benchmark process (the only running process) that will usually get the timer tick since it is the active CPU (no point in waking up a different sleeping CPU). This adds some variance, but only has much as the timer tick routine which is fairly efficient (maybe worse now with meltdown/spectre - this was before all that).

If you are using nohz_full, however, it is some other CPU that is woken up (assuming you set the CPU you are benchmarking on as "adaptive tick" to get the full benefit of NOHZ). You'd think that would be better than doing it on the benchmark CPU, but it was worse: the second CPU spinning up caused a turbo transition which takes 10s of microseconds, and then it causes the CPU to run 200 MHz slower for the duration of the interrupt (which is probably much slower than doing it on the current CPU because some cache data may have been lost in sleep, etc).

So with turbo on, it was actually fastest and lower variance just to take your interrupts on the benchmark core than to deal with the turbo ratio gyrating around all the time. With turbo off it might be different but in this case I wanted to run with turbo on.




[1] (this is from memory so should be considered approximate)
< Previous Post in ThreadNext Post in Thread >
Thread (64 posts)
TopicPosted ByPosted
Weird L2 latency effectTravis
  Same effect on HaswellTravis
    Is your benchmark supposed to compile?Heikki Kultala
      Is your benchmark supposed to compile?Travis
        thanks (NT)Heikki Kultala
          thanksTravis
            thanksgallier2
              Thanks for the note, fixed (NT)Travis
  In Ryzen the dummy load does not slow down the pointer chaseHeikki Kultala
    In Ryzen the dummy load does not slow down the pointer chaseTravis
      In Ryzen the dummy load does not slow down the pointer chaseHeikki Kultala
        In Ryzen the dummy load does not slow down the pointer chaseJeff S.
          In Ryzen the dummy load does not slow down the pointer chaseTravis
            Single user mode ?Adrian
              Single user mode ?Travis
                Single user mode ?Adrian
                  Single user mode ?Travis
                Single user mode ?Peter E. Fry
                  Single user mode ?Travis
                Distrosanon
                Single user mode ?Ricardo B
              less painful, and maybe even more effective, optionsJeff S.
                less painful, and maybe even more effective, optionsTravis
                  nuclear optionJeff S.
                  less painful, and maybe even more effective, optionsLinus Torvalds
                    witness to RT/HPC crazinessJeff S.
      In Ryzen the dummy load does not slow down the pointer chaseBigos
        In Ryzen the dummy load does not slow down the pointer chaseTravis
  IvyB - positiveMichael S
    Thanks for your IvB results (NT)Travis
  Nehalem - similar but not quiteanon
    Nehalem - similar but not quiteTravis
      Nehalem - similar but not quiteanon
        Nehalem - similar but not quiteTravis Downs
          Nehalem - similar but not quiteanon
            Nehalem - similar but not quiteTravis
              Nehalem - similar but not quiteanon
    Nehalem - similar but not quiteTravis
      Nehalem - similar but not quiteanon
        Nehalem - similar but not quiteTravis
          Nehalem - similar but not quiteanon
            Nehalem - similar but not quiteTravis
              Nehalem - similar but not quiteanon
                Nehalem - similar but not quiteTravis
                  Nehalem - similar but not quiteanon
                    Nehalem - similar but not quiteTravis
                      Nehalem - similar but not quiteanon
                        Nehalem - similar but not quiteTravis
                          Nehalem - similar but not quiteanon
  Weird L2 latency effectLinus Torvalds
    Weird L2 latency effectMichael S
      Weird L2 latency effectLinus Torvalds
        Weird L2 latency effectTravis
      Weird L2 latency effectTravis
    Weird L2 latency effectLinus Torvalds
      Weird L2 latency effectTravis
    maybe simply CWF optimization in action?Jeff S.
      maybe simply CWF optimization in action?Travis
        maybe simply CWF optimization in action?Jeff S.
          maybe simply CWF optimization in action?Travis
  one, two, three, fourMichael S
    one, two, three, fourTravis
  Weird L2 latency effect: Skylake-X-
  Weird L2 latency effectTravis Downs