Real World Technologies - Forums - Thread: Weird L2 latency effect

By: Travis (travis.downs.delete@this.gmail.com), 2018-08-04 21:10 UTC

Jeff S. (fakity.delete@this.fake.com) on August 4, 2018 1:46 pm wrote:
> Adrian (a.delete@this.acm.org) on August 3, 2018 7:08 am wrote:
> > Travis (travis.downs.delete@this.gmail.com) on August 2, 2018 11:06 am wrote:
> > > BTW, if anyone has any recommendation on how to boot into a really "quiet" system for
> > > benchmarking, I'd be happy to hear it. Certainly it should be non-GUI, but that's probably
> > > not enough on a mainstream distro since there will still be a lot of background services,
> > > etc, running. Maybe just a cut down distro that defaults to everything off?
> > I suppose that rebooting in single user mode is the best you can do without using some special kernel.
> > On the kinds of operating systems that I am using, this is done with "init
> > 1". On systems with systemd, that is more complex. You should study your local
>
> You can go indefinitely down the rabbit hole, but quite a lot can be accomplished with:
> - disabling C-states in BIOS to keep your system from trying to flush and sleeps its caches
> - booting with the 'isolcpus' kernel parameter to keep some
> set of cores away from the dynamic scheduler entirely
> - booting with tickless/'nohz_full' kernel parameter to get unbounded time quanta on threads given their
> own cores
> - disabling the irqbalance service and pinning IRQ handling to a "dirty" core
> - booting with 'rcu_nocbs' and 'rcu_nocb_poll' to similarly redirect RCU spam to a dirty core

I have already done most of these things. These are important for "on core noise" (i.e., they solve things that affect even core-bound code), but I've got that close to zero anyways. In particular, if you do short tests, 99% of the time you won't get interrupted anyways and you can detect the cases you did and throw them out (although it seems you can't detect SMC interrupts except indirectly).

So I am good in that respect. I'm looking for an approach that reduces noise on the shared part of the system for tests that stress L3 or memory.

Two notes on your list:

1) I didn't find isolcpus to really make any difference in practice for benchmarks. If you have enough cores and pin your task the scheduler will never interrupt it anyways, at least to a first order approximation. I used it for a while, but never really saw any difference and removed it.

2) I found nohz_full to be *worse* than without for my use-case. Now either on or off is fine for small benchmarks as described above, but I saw more variance with nohz_full for longer benchmarks.

Eventually I traced it down to this[1]: even when using nohz_full one CPU has to receive the timer tick periodically to keep track of whatever the kernel needs to keep track of periodically no matter what. If you are using the normal approach (without nohz_full), it is your benchmark process (the only running process) that will usually get the timer tick since it is the active CPU (no point in waking up a different sleeping CPU). This adds some variance, but only has much as the timer tick routine which is fairly efficient (maybe worse now with meltdown/spectre - this was before all that).

If you are using nohz_full, however, it is some other CPU that is woken up (assuming you set the CPU you are benchmarking on as "adaptive tick" to get the full benefit of NOHZ). You'd think that would be better than doing it on the benchmark CPU, but it was worse: the second CPU spinning up caused a turbo transition which takes 10s of microseconds, and then it causes the CPU to run 200 MHz slower for the duration of the interrupt (which is probably much slower than doing it on the current CPU because some cache data may have been lost in sleep, etc).

So with turbo on, it was actually fastest and lower variance just to take your interrupts on the benchmark core than to deal with the turbo ratio gyrating around all the time. With turbo off it might be different but in this case I wanted to run with turbo on.

[1] (this is from memory so should be considered approximate)

< Previous Post in Thread

Next Post in Thread >

Thread (64 posts)

Topic	Posted By	Posted
Weird L2 latency effect	Travis	2018-08-02 05:01 UTC
Same effect on Haswell	Travis	2018-08-02 05:56 UTC
Is your benchmark supposed to compile?	Heikki Kultala	2018-08-02 06:15 UTC
Is your benchmark supposed to compile?	Travis	2018-08-02 06:17 UTC
thanks (NT)	Heikki Kultala	2018-08-02 06:18 UTC
thanks	Travis	2018-08-02 06:29 UTC
thanks	gallier2	2018-08-02 07:31 UTC
Thanks for the note, fixed (NT)	Travis	2018-08-02 16:02 UTC
In Ryzen the dummy load does not slow down the pointer chase	Heikki Kultala	2018-08-02 06:24 UTC
In Ryzen the dummy load does not slow down the pointer chase	Travis	2018-08-02 06:31 UTC
In Ryzen the dummy load does not slow down the pointer chase	Heikki Kultala	2018-08-02 06:59 UTC
In Ryzen the dummy load does not slow down the pointer chase	Jeff S.	2018-08-02 13:18 UTC
In Ryzen the dummy load does not slow down the pointer chase	Travis	2018-08-02 18:06 UTC
Single user mode ?	Adrian	2018-08-03 14:08 UTC
Single user mode ?	Travis	2018-08-03 17:44 UTC
Single user mode ?	Adrian	2018-08-03 21:54 UTC
Single user mode ?	Travis	2018-08-03 22:21 UTC
Single user mode ?	Peter E. Fry	2018-08-04 13:29 UTC
Single user mode ?	Travis	2018-08-04 18:04 UTC
Distros	anon	2018-08-04 20:06 UTC
Single user mode ?	Ricardo B	2018-08-05 01:24 UTC
less painful, and maybe even more effective, options	Jeff S.	2018-08-04 20:46 UTC
less painful, and maybe even more effective, options	Travis	2018-08-04 21:10 UTC
nuclear option	Jeff S.	2018-08-04 23:46 UTC
less painful, and maybe even more effective, options	Linus Torvalds	2018-08-05 19:06 UTC
witness to RT/HPC craziness	Jeff S.	2018-08-05 19:55 UTC
In Ryzen the dummy load does not slow down the pointer chase	Bigos	2018-08-02 20:51 UTC
In Ryzen the dummy load does not slow down the pointer chase	Travis	2018-08-02 22:40 UTC
IvyB - positive	Michael S	2018-08-02 09:48 UTC
Thanks for your IvB results (NT)	Travis	2018-08-02 18:09 UTC
Nehalem - similar but not quite	anon	2018-08-02 13:19 UTC
Nehalem - similar but not quite	Travis	2018-08-02 18:24 UTC
Nehalem - similar but not quite	anon	2018-08-02 20:52 UTC
Nehalem - similar but not quite	Travis Downs	2018-08-02 21:11 UTC
Nehalem - similar but not quite	anon	2018-08-02 21:21 UTC
Nehalem - similar but not quite	Travis	2018-08-02 22:02 UTC
Nehalem - similar but not quite	anon	2018-08-02 23:09 UTC
Nehalem - similar but not quite	Travis	2018-08-04 04:00 UTC
Nehalem - similar but not quite	anon	2018-08-04 11:30 UTC
Nehalem - similar but not quite	Travis	2018-08-04 18:02 UTC
Nehalem - similar but not quite	anon	2018-08-04 18:28 UTC
Nehalem - similar but not quite	Travis	2018-08-04 23:11 UTC
Nehalem - similar but not quite	anon	2018-08-05 08:28 UTC
Nehalem - similar but not quite	Travis	2018-08-08 17:10 UTC
Nehalem - similar but not quite	anon	2018-08-08 20:51 UTC
Nehalem - similar but not quite	Travis	2018-08-09 03:57 UTC
Nehalem - similar but not quite	anon	2018-08-09 07:41 UTC
Nehalem - similar but not quite	Travis	2018-08-16 00:23 UTC
Nehalem - similar but not quite	anon	2018-08-16 12:06 UTC
Weird L2 latency effect	Linus Torvalds	2018-08-02 17:00 UTC
Weird L2 latency effect	Michael S	2018-08-02 19:06 UTC
Weird L2 latency effect	Linus Torvalds	2018-08-02 19:12 UTC
Weird L2 latency effect	Travis	2018-08-03 02:46 UTC
Weird L2 latency effect	Travis	2018-08-02 22:11 UTC
Weird L2 latency effect	Linus Torvalds	2018-08-02 19:09 UTC
Weird L2 latency effect	Travis	2018-08-02 22:22 UTC
maybe simply CWF optimization in action?	Jeff S.	2018-08-02 19:52 UTC
maybe simply CWF optimization in action?	Travis	2018-08-02 21:39 UTC
maybe simply CWF optimization in action?	Jeff S.	2018-08-02 21:57 UTC
maybe simply CWF optimization in action?	Travis	2018-08-03 00:37 UTC
one, two, three, four	Michael S	2018-08-02 19:02 UTC
one, two, three, four	Travis	2018-08-02 21:26 UTC
Weird L2 latency effect: Skylake-X	-	2018-08-06 12:38 UTC
Weird L2 latency effect	Travis Downs	2019-06-07 02:49 UTC

Reply to this Topic
Name:
Email:
Topic:
Body:	No Text Travis (travis.downs.delete@this.gmail.com) on 2018-08-04 21:10 UTC wrote: > Jeff S. (fakity.delete@this.fake.com) on August 4, 2018 1:46 pm wrote: > > Adrian (a.delete@this.acm.org) on August 3, 2018 7:08 am wrote: > > > Travis (travis.downs.delete@this.gmail.com) on August 2, 2018 11:06 am wrote: > > > > BTW, if anyone has any recommendation on how to boot into a really "quiet" system for > > > > benchmarking, I'd be happy to hear it. Certainly it should be non-GUI, but that's probably > > > > not enough on a mainstream distro since there will still be a lot of background services, > > > > etc, running. Maybe just a cut down distro that defaults to everything off? > > > I suppose that rebooting in single user mode is the best you can do without using some special kernel. > > > On the kinds of operating systems that I am using, this is done with "init > > > 1". On systems with systemd, that is more complex. You should study your local > > > > You can go indefinitely down the rabbit hole, but quite a lot can be accomplished with: > > - disabling C-states in BIOS to keep your system from trying to flush and sleeps its caches > > - booting with the 'isolcpus' kernel parameter to keep some > > set of cores away from the dynamic scheduler entirely > > - booting with tickless/'nohz_full' kernel parameter to get unbounded time quanta on threads given their > > own cores > > - disabling the irqbalance service and pinning IRQ handling to a "dirty" core > > - booting with 'rcu_nocbs' and 'rcu_nocb_poll' to similarly redirect RCU spam to a dirty core > > I have already done most of these things. These are important for "on core noise" (i.e., they solve > things that affect even core-bound code), but I've got that close to zero anyways. In particular, if > you do short tests, 99% of the time you won't get interrupted anyways and you can detect the cases > you did and throw them out (although it seems you can't detect SMC interrupts except indirectly). > > So I am good in that respect. I'm looking for an approach that reduces noise > on the <i>shared</i> part of the system for tests that stress L3 or memory. > > Two notes on your list: > > 1) I didn't find isolcpus to really make any difference in practice for benchmarks. If you have enough > cores and pin your task the scheduler will never interrupt it anyways, at least to a first order > approximation. I used it for a while, but never really saw any difference and removed it. > > 2) I found nohz_full to be worse than without for my use-case. Now either on or off is fine for small > benchmarks as described above, but I saw more variance with nohz_full for longer benchmarks. > > Eventually I traced it down to this[1]: even when using nohz_full one CPU has to receive the timer > tick periodically to keep track of whatever the kernel needs to keep track of periodically no matter > what. If you are using the normal approach (without nohz_full), it is your benchmark process (the only > running process) that will usually get the timer tick since it is the active CPU (no point in waking > up a different sleeping CPU). This adds some variance, but only has much as the timer tick routine > which is fairly efficient (maybe worse now with meltdown/spectre - this was before all that). > > If you are using nohz_full, however, it is some <i>other</i> CPU that is woken up (assuming you set > the CPU you are benchmarking on as "adaptive tick" to get the full benefit of NOHZ). You'd think > that would be better than doing it on the benchmark CPU, but it was worse: the second CPU spinning > up caused a turbo transition which takes 10s of microseconds, and then it causes the CPU to run > 200 MHz slower for the duration of the interrupt (which is probably much slower than doing it > on the current CPU because some cache data may have been lost in sleep, etc). > > So with turbo on, it was actually fastest and lower variance just to take your interrupts > on the benchmark core than to deal with the turbo ratio gyrating around all the time. With > turbo off it might be different but in this case I wanted to run with turbo on. > > <hr> > > [1] (this is from memory so should be considered approximate) >
Explain 🐈🐕:	(no spaces, 6 letters, lowercase)

less painful, and maybe even more effective, options

Editor’s Picks

Intel’s Sandy Bridge Microarchitecture

The Common System Interface: Intel’s Future Interconnect

3D Integration: A Revolution in Design