Real World Technologies - Forums - Thread: Weird L2 latency effect

By: Travis (travis.downs.delete@this.gmail.com), 2018-08-02 05:01 UTC

Consider the following simple loop:



.top:

mov rsi, [rsi]



dec rdi  ; the loop overhead doesn't contribute to the runtime

jnz .top

This is just the bare-bones classic pointer chasing loop. If the pointers have been set up to traverse an L2 sized region, you get pretty much exactly 12.0 cycles per iteration on modern Intel as expected: that's the L2 latency.

What if we add a dummy load also from [rsi] to a register we never subsequently access like this?



.top:

mov rax, [rsi]

mov rsi, [rsi]



dec rdi  ; the loop overhead doesn't contribute to the runtime

jnz .top

My mental performance model is that this will do nothing: the mov rax, [rsi] is just dangling out in nowhere land, is not part of any dependency chain, and there is no contention for any resource, really (the CPU is sitting around doing nothing on just about 11 out of 12 cycles, after all).

My mental model is wrong, however. The runtime is in fact always 19.xx cycles.

If we change the order of the dummy load and the pointer chasing load around, so the pointer chasing load comes first (we have to introduce an additional mov because we don't want to update rsi until the dummy load has occured or else we are changing the pattern back to the first example, but this doesn't affect anything):



.top:

mov rcx, [rsi]

mov rax, [rsi]

mov rsi, rcx



...

The behavior disappears!

What micro-architectural detail causes this behavior?

It's like the first load that misses L1 for a given line is "special" and receives the value directly from L2 with the minimum latency, but that subsequent loads go through a slower path: perhaps waiting for the line from L2 to be filled in L1 and then being woken and suffering an additional L1 hit latency. The observed difference between the two cases is 7 cycles, which could be 4 cycles L1 latency plus a cycle extra to complete the L1 fill, and a cycle or two to wake the waiting load?

It's an interesting effect because it means when you know you'll be getting several hits on a line you want to organize your code so that if one load is on the critical path you make sure perform that one first (although that happens naturally anyways a lot of the time).

Next Post in Thread >

Thread (64 posts)

Topic	Posted By	Posted
Weird L2 latency effect	Travis	2018-08-02 05:01 UTC
Same effect on Haswell	Travis	2018-08-02 05:56 UTC
Is your benchmark supposed to compile?	Heikki Kultala	2018-08-02 06:15 UTC
Is your benchmark supposed to compile?	Travis	2018-08-02 06:17 UTC
thanks (NT)	Heikki Kultala	2018-08-02 06:18 UTC
thanks	Travis	2018-08-02 06:29 UTC
thanks	gallier2	2018-08-02 07:31 UTC
Thanks for the note, fixed (NT)	Travis	2018-08-02 16:02 UTC
In Ryzen the dummy load does not slow down the pointer chase	Heikki Kultala	2018-08-02 06:24 UTC
In Ryzen the dummy load does not slow down the pointer chase	Travis	2018-08-02 06:31 UTC
In Ryzen the dummy load does not slow down the pointer chase	Heikki Kultala	2018-08-02 06:59 UTC
In Ryzen the dummy load does not slow down the pointer chase	Jeff S.	2018-08-02 13:18 UTC
In Ryzen the dummy load does not slow down the pointer chase	Travis	2018-08-02 18:06 UTC
Single user mode ?	Adrian	2018-08-03 14:08 UTC
Single user mode ?	Travis	2018-08-03 17:44 UTC
Single user mode ?	Adrian	2018-08-03 21:54 UTC
Single user mode ?	Travis	2018-08-03 22:21 UTC
Single user mode ?	Peter E. Fry	2018-08-04 13:29 UTC
Single user mode ?	Travis	2018-08-04 18:04 UTC
Distros	anon	2018-08-04 20:06 UTC
Single user mode ?	Ricardo B	2018-08-05 01:24 UTC
less painful, and maybe even more effective, options	Jeff S.	2018-08-04 20:46 UTC
less painful, and maybe even more effective, options	Travis	2018-08-04 21:10 UTC
nuclear option	Jeff S.	2018-08-04 23:46 UTC
less painful, and maybe even more effective, options	Linus Torvalds	2018-08-05 19:06 UTC
witness to RT/HPC craziness	Jeff S.	2018-08-05 19:55 UTC
In Ryzen the dummy load does not slow down the pointer chase	Bigos	2018-08-02 20:51 UTC
In Ryzen the dummy load does not slow down the pointer chase	Travis	2018-08-02 22:40 UTC
IvyB - positive	Michael S	2018-08-02 09:48 UTC
Thanks for your IvB results (NT)	Travis	2018-08-02 18:09 UTC
Nehalem - similar but not quite	anon	2018-08-02 13:19 UTC
Nehalem - similar but not quite	Travis	2018-08-02 18:24 UTC
Nehalem - similar but not quite	anon	2018-08-02 20:52 UTC
Nehalem - similar but not quite	Travis Downs	2018-08-02 21:11 UTC
Nehalem - similar but not quite	anon	2018-08-02 21:21 UTC
Nehalem - similar but not quite	Travis	2018-08-02 22:02 UTC
Nehalem - similar but not quite	anon	2018-08-02 23:09 UTC
Nehalem - similar but not quite	Travis	2018-08-04 04:00 UTC
Nehalem - similar but not quite	anon	2018-08-04 11:30 UTC
Nehalem - similar but not quite	Travis	2018-08-04 18:02 UTC
Nehalem - similar but not quite	anon	2018-08-04 18:28 UTC
Nehalem - similar but not quite	Travis	2018-08-04 23:11 UTC
Nehalem - similar but not quite	anon	2018-08-05 08:28 UTC
Nehalem - similar but not quite	Travis	2018-08-08 17:10 UTC
Nehalem - similar but not quite	anon	2018-08-08 20:51 UTC
Nehalem - similar but not quite	Travis	2018-08-09 03:57 UTC
Nehalem - similar but not quite	anon	2018-08-09 07:41 UTC
Nehalem - similar but not quite	Travis	2018-08-16 00:23 UTC
Nehalem - similar but not quite	anon	2018-08-16 12:06 UTC
Weird L2 latency effect	Linus Torvalds	2018-08-02 17:00 UTC
Weird L2 latency effect	Michael S	2018-08-02 19:06 UTC
Weird L2 latency effect	Linus Torvalds	2018-08-02 19:12 UTC
Weird L2 latency effect	Travis	2018-08-03 02:46 UTC
Weird L2 latency effect	Travis	2018-08-02 22:11 UTC
Weird L2 latency effect	Linus Torvalds	2018-08-02 19:09 UTC
Weird L2 latency effect	Travis	2018-08-02 22:22 UTC
maybe simply CWF optimization in action?	Jeff S.	2018-08-02 19:52 UTC
maybe simply CWF optimization in action?	Travis	2018-08-02 21:39 UTC
maybe simply CWF optimization in action?	Jeff S.	2018-08-02 21:57 UTC
maybe simply CWF optimization in action?	Travis	2018-08-03 00:37 UTC
one, two, three, four	Michael S	2018-08-02 19:02 UTC
one, two, three, four	Travis	2018-08-02 21:26 UTC
Weird L2 latency effect: Skylake-X	-	2018-08-06 12:38 UTC
Weird L2 latency effect	Travis Downs	2019-06-07 02:49 UTC

Reply to this Topic
Name:
Email:
Topic:
Body:	No Text Travis (travis.downs.delete@this.gmail.com) on 2018-08-02 05:01 UTC wrote: > Consider the following simple loop: > > <code> > .top: > mov rsi, [rsi] > > dec rdi ; the loop overhead doesn't contribute to the runtime > jnz .top > </code> > > This is just the bare-bones classic pointer chasing loop. If the pointers have > been set up to traverse an L2 sized region, you get pretty much exactly 12.0 > cycles per iteration on modern Intel as expected: that's the L2 latency. > > What if we add a dummy load also from [rsi] to a register we never subsequently access like this? > > <code> > .top: > mov rax, [rsi] > mov rsi, [rsi] > > dec rdi ; the loop overhead doesn't contribute to the runtime > jnz .top > </code> > > > My mental performance model is that this will do nothing: the <code>mov rax, [rsi]</code> is just dangling out > in nowhere land, is not part of any dependency chain, and there is no contention for any resource, > really (the CPU is sitting around doing nothing on just about 11 out of 12 cycles, after all). > > My mental model is wrong, however. The runtime is in fact always 19.xx cycles. > > If we change the order of the dummy load and the pointer chasing load around, so the pointer chasing load comes > first (we have to introduce an additional mov because we don't want to update rsi until the dummy load has occured > or else we are changing the pattern back to the first example, but this doesn't affect anything): > > > <code> > .top: > mov rcx, [rsi] > mov rax, [rsi] > mov rsi, rcx > > ... > </code> > > The behavior disappears! > > What micro-architectural detail causes this behavior? > > It's like the first load that misses L1 for a given line is "special" and receives the value directly > from L2 with the minimum latency, but that subsequent loads go through a slower path: perhaps waiting > for the line from L2 to be filled in L1 and then being woken and suffering an additional L1 hit latency. > The observed difference between the two cases is 7 cycles, which could be 4 cycles L1 latency plus > a cycle extra to complete the L1 fill, and a cycle or two to wake the waiting load? > > It's an interesting effect because it means when you know you'll be getting several hits on > a line you want to organize your code so that if one load is on the critical path you make > sure perform that one first (although that happens naturally anyways a lot of the time).
Explain 🐈🐕:	(no spaces, 6 letters, lowercase)

Weird L2 latency effect

Editor’s Picks

Intel’s Sandy Bridge Microarchitecture

Intel’s Haswell CPU Microarchitecture

Silvermont, Intel’s Low Power Architecture