By: Travis (travis.downs.delete@this.gmail.com),
Room: Moderated Discussions
Consider the following simple loop:
This is just the bare-bones classic pointer chasing loop. If the pointers have been set up to traverse an L2 sized region, you get pretty much exactly 12.0 cycles per iteration on modern Intel as expected: that's the L2 latency.
What if we add a dummy load also from [rsi] to a register we never subsequently access like this?
My mental performance model is that this will do nothing: the
My mental model is wrong, however. The runtime is in fact always 19.xx cycles.
If we change the order of the dummy load and the pointer chasing load around, so the pointer chasing load comes first (we have to introduce an additional mov because we don't want to update rsi until the dummy load has occured or else we are changing the pattern back to the first example, but this doesn't affect anything):
The behavior disappears!
What micro-architectural detail causes this behavior?
It's like the first load that misses L1 for a given line is "special" and receives the value directly from L2 with the minimum latency, but that subsequent loads go through a slower path: perhaps waiting for the line from L2 to be filled in L1 and then being woken and suffering an additional L1 hit latency. The observed difference between the two cases is 7 cycles, which could be 4 cycles L1 latency plus a cycle extra to complete the L1 fill, and a cycle or two to wake the waiting load?
It's an interesting effect because it means when you know you'll be getting several hits on a line you want to organize your code so that if one load is on the critical path you make sure perform that one first (although that happens naturally anyways a lot of the time).
.top:
mov rsi, [rsi]
dec rdi ; the loop overhead doesn't contribute to the runtime
jnz .top
This is just the bare-bones classic pointer chasing loop. If the pointers have been set up to traverse an L2 sized region, you get pretty much exactly 12.0 cycles per iteration on modern Intel as expected: that's the L2 latency.
What if we add a dummy load also from [rsi] to a register we never subsequently access like this?
.top:
mov rax, [rsi]
mov rsi, [rsi]
dec rdi ; the loop overhead doesn't contribute to the runtime
jnz .top
My mental performance model is that this will do nothing: the
mov rax, [rsi] is just dangling out in nowhere land, is not part of any dependency chain, and there is no contention for any resource, really (the CPU is sitting around doing nothing on just about 11 out of 12 cycles, after all).My mental model is wrong, however. The runtime is in fact always 19.xx cycles.
If we change the order of the dummy load and the pointer chasing load around, so the pointer chasing load comes first (we have to introduce an additional mov because we don't want to update rsi until the dummy load has occured or else we are changing the pattern back to the first example, but this doesn't affect anything):
.top:
mov rcx, [rsi]
mov rax, [rsi]
mov rsi, rcx
...
The behavior disappears!
What micro-architectural detail causes this behavior?
It's like the first load that misses L1 for a given line is "special" and receives the value directly from L2 with the minimum latency, but that subsequent loads go through a slower path: perhaps waiting for the line from L2 to be filled in L1 and then being woken and suffering an additional L1 hit latency. The observed difference between the two cases is 7 cycles, which could be 4 cycles L1 latency plus a cycle extra to complete the L1 fill, and a cycle or two to wake the waiting load?
It's an interesting effect because it means when you know you'll be getting several hits on a line you want to organize your code so that if one load is on the critical path you make sure perform that one first (although that happens naturally anyways a lot of the time).



