uop replay after L1 miss

By: Travis Downs (travis.downs.delete@this.gmail.com), September 30, 2018 6:25 pm
Room: Moderated Discussions
Somewhat tangentially related to recent discussions on pointer-chasing latency with and without ALU ops in the chain, I thought I'd share some recent findings on uop replay on modern Intel (Skylake client, specifically).

The presence of replay is detected through performance counters, specifically the "uops_dispatched_port" counters: I think they are accurate and the results generally make sense but as with anything with performance counters there is always the chance that what they show is just a mirage: the counters might not be counting accurately.

I found that uops dependent on a load which miss in L1 [1] but hit in L2 are replayed. Perhaps more interestingly, I found that more than one uop may be replayed. This is perhaps obvious in the case that several ops depend on directly on the load result, in which case all of them may have tried to execute and need to be replayed, like so:
.top:
mov rsi, [rsi]
add r8d, esi
add r9d, esi
add r10d, esi
; ...
jcc .top
In this case all of the additions may try to run when the load is expected to return (5 cycles later assuming L1 hit) and all will need to be replayed when that doesn't happen.

What is interesting is even ops that don't don't directly depend on the load result, but depend on an ALU op that itself depends on the load, may also be replayed. For example:
.top:
mov rsi, [rsi]
add r8d, esi
add r9d, r8d
; ...
jcc .top
Here, the add r9d, r8d depends on the earlier addition, but not directly on the load, but it is also replayed. The limit seems to apply to operations that would have started up to about 3 cycles after the load would have returned. For the first couple of cycles it is exact but right at around 3 cycles you see sometimes see the uop replayed only half the time (e.g., a chain of 4 1-cycle uops depending on a load might show 3.5 replayed uops.

I guess the replay detection and implementation has some latency and in the meantime these extra dependent ops execute and have to be replayed.

If the load also misses in L2 but hits in L3, you see a doubling of the replayed uops, so I guess the same approach is used for anticipated L2 hits. For L3 misses there is no similar effect, which makes sense since the scheduler would probably not try to predict the variable L3 latency.

What can you do with this information? Not that much. Perhaps if you expect L1 misses and L2 hits you might try to organize execution so that you get fewer replays, but this is hard. It probably gives a small boost to software prefetching, since using SW PF you could organize for lines to be L1 hits more easily: in some way you pay the uop cost of SW PF by avoiding the uop cost of replays.

Perhaps this information is useful in case you see overcounting on the "dispatched to port X" counters, even in scenarios without a lot of bad speculation: another source of extra events are these replays associated with misses.






[1] This also includes uops where there is a hit in L1, but the load was mis-predicted to take the 4-cycle fast path but it didn't pan out because the base register address in the address pointed to a different page than base + offset.
 Next Post in Thread >
TopicPosted ByDate
uop replay after L1 missTravis Downs2018/09/30 06:25 PM
  uop replay after L1 missLL2018/10/08 05:55 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?