Weird L2 latency effect

By: Travis (travis.downs.delete@this.gmail.com),
Room: Moderated Discussions
Consider the following simple loop:


.top:
mov rsi, [rsi]

dec rdi ; the loop overhead doesn't contribute to the runtime
jnz .top


This is just the bare-bones classic pointer chasing loop. If the pointers have been set up to traverse an L2 sized region, you get pretty much exactly 12.0 cycles per iteration on modern Intel as expected: that's the L2 latency.

What if we add a dummy load also from [rsi] to a register we never subsequently access like this?


.top:
mov rax, [rsi]
mov rsi, [rsi]

dec rdi ; the loop overhead doesn't contribute to the runtime
jnz .top



My mental performance model is that this will do nothing: the mov rax, [rsi] is just dangling out in nowhere land, is not part of any dependency chain, and there is no contention for any resource, really (the CPU is sitting around doing nothing on just about 11 out of 12 cycles, after all).

My mental model is wrong, however. The runtime is in fact always 19.xx cycles.

If we change the order of the dummy load and the pointer chasing load around, so the pointer chasing load comes first (we have to introduce an additional mov because we don't want to update rsi until the dummy load has occured or else we are changing the pattern back to the first example, but this doesn't affect anything):



.top:
mov rcx, [rsi]
mov rax, [rsi]
mov rsi, rcx

...


The behavior disappears!

What micro-architectural detail causes this behavior?

It's like the first load that misses L1 for a given line is "special" and receives the value directly from L2 with the minimum latency, but that subsequent loads go through a slower path: perhaps waiting for the line from L2 to be filled in L1 and then being woken and suffering an additional L1 hit latency. The observed difference between the two cases is 7 cycles, which could be 4 cycles L1 latency plus a cycle extra to complete the L1 fill, and a cycle or two to wake the waiting load?

It's an interesting effect because it means when you know you'll be getting several hits on a line you want to organize your code so that if one load is on the critical path you make sure perform that one first (although that happens naturally anyways a lot of the time).
 Next Post in Thread >
Thread (64 posts)
TopicPosted ByPosted
Weird L2 latency effectTravis
  Same effect on HaswellTravis
    Is your benchmark supposed to compile?Heikki Kultala
      Is your benchmark supposed to compile?Travis
        thanks (NT)Heikki Kultala
          thanksTravis
            thanksgallier2
              Thanks for the note, fixed (NT)Travis
  In Ryzen the dummy load does not slow down the pointer chaseHeikki Kultala
    In Ryzen the dummy load does not slow down the pointer chaseTravis
      In Ryzen the dummy load does not slow down the pointer chaseHeikki Kultala
        In Ryzen the dummy load does not slow down the pointer chaseJeff S.
          In Ryzen the dummy load does not slow down the pointer chaseTravis
            Single user mode ?Adrian
              Single user mode ?Travis
                Single user mode ?Adrian
                  Single user mode ?Travis
                Single user mode ?Peter E. Fry
                  Single user mode ?Travis
                Distrosanon
                Single user mode ?Ricardo B
              less painful, and maybe even more effective, optionsJeff S.
                less painful, and maybe even more effective, optionsTravis
                  nuclear optionJeff S.
                  less painful, and maybe even more effective, optionsLinus Torvalds
                    witness to RT/HPC crazinessJeff S.
      In Ryzen the dummy load does not slow down the pointer chaseBigos
        In Ryzen the dummy load does not slow down the pointer chaseTravis
  IvyB - positiveMichael S
    Thanks for your IvB results (NT)Travis
  Nehalem - similar but not quiteanon
    Nehalem - similar but not quiteTravis
      Nehalem - similar but not quiteanon
        Nehalem - similar but not quiteTravis Downs
          Nehalem - similar but not quiteanon
            Nehalem - similar but not quiteTravis
              Nehalem - similar but not quiteanon
    Nehalem - similar but not quiteTravis
      Nehalem - similar but not quiteanon
        Nehalem - similar but not quiteTravis
          Nehalem - similar but not quiteanon
            Nehalem - similar but not quiteTravis
              Nehalem - similar but not quiteanon
                Nehalem - similar but not quiteTravis
                  Nehalem - similar but not quiteanon
                    Nehalem - similar but not quiteTravis
                      Nehalem - similar but not quiteanon
                        Nehalem - similar but not quiteTravis
                          Nehalem - similar but not quiteanon
  Weird L2 latency effectLinus Torvalds
    Weird L2 latency effectMichael S
      Weird L2 latency effectLinus Torvalds
        Weird L2 latency effectTravis
      Weird L2 latency effectTravis
    Weird L2 latency effectLinus Torvalds
      Weird L2 latency effectTravis
    maybe simply CWF optimization in action?Jeff S.
      maybe simply CWF optimization in action?Travis
        maybe simply CWF optimization in action?Jeff S.
          maybe simply CWF optimization in action?Travis
  one, two, three, fourMichael S
    one, two, three, fourTravis
  Weird L2 latency effect: Skylake-X-
  Weird L2 latency effectTravis Downs