4-cycle L1 latency on Intel not as general as though

By: anon (spam.delete.delete@this.this.spam.com), September 18, 2018 2:43 am
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on September 17, 2018 4:32 pm wrote:
> Perhaps the title should say "not as general as I thought" - but anyways...
>
> Until recently, I had believed that you get 4-cycle L1 loads on Intel as long as you have "simple addressing"
> (base + offset, where offset is in [0, 2047]). Later on I realized that you can actually get 4-cycle latency
> even with "indexed" addressing modes, but only if the index component is zero at runtime and was set to zero
> with a zeroing idiom. That doesn't change much since that pattern is probably unlikely in real code.
>
>
> More recently, I read and confirmed that you only get the 4-cycle loads if the source of the base
> register was an earlier load, and not something else like an ALU op. For example:
mov rax, [...]
mov
> rdx, [rax + 128] ; this load has 4 cycle latency because the address reg
;
> rax came from a load (in this case, the prior instruction)

lea rax, [...]
mov rdx, [rax + 128] ;
> this load, despite being identical to the above,
; takes 5 cycles because here
> rax was set by an non-load
; instead (lea, but it could be any ALU op really)
So
> that's already weird - the fast path only works when a load result is being fed directly back into
> another load, but not when the load address is the result of some other ALU operation.
>

Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.
It could also be bypass delay. Not sure what the likelihood of a store to a calculated adress being reloaded immediately is but I can't imagine it would be very high. Similarly for anything but a pointer chase the extra cycle for a load shouldn't matter when you've got other dependency chains that can execute in parallel.

> Today though, I ran across this StackOverflow post on which the subsequent
> investigation showed that 4-cycle latency loads are even more special!
>
> In particular, when you do a load like mov rax, [rax + 128] you only get the result in 4 cycles if the
> earlier condition about load-feeding-load is met and both rax and rax + 128 point into the same page!
> That is, if rax is pointing at the last 128 bytes of a page, then the actual address after the +128 offset
> will be in the next page, and you don't get a fast load. In fact, you don't even just get a slow 5 cycle
> load: the load is replayed and ends up at 9 cycles total (Haswell) or 10 cycles total (Skylake).
>
> My idea of how this works is the same as Peter mentions in his answer: in this fast case the TLB
> lookup starts in parallel with the actual address calculation, based solely on the base register:
> this apparently allows shaving one cycle off of the TLB -> tag check critical path, and the lack
> of index register presumably allows shaving at least a cycle off of the address calculation path.
>
> The downside is if the speculation is wrong you pay a not insubstantial penalty. This penalty may
> be why this 4-cycle optimization is only applied when a load feeds a load: perhaps this is trying
> to detect pointer chasing loads and applying the 4-cycle fast path only there where the benefits
> are likely to overcome the penalty costs. Or perhaps there is just something special about the load
> EUs that make it possible to consume the load more quickly if it was a result of another load.
>

Like Peter mentioned it also only attempts this for displacements less than 2048, so they probably did the math that this would get them a speedup on average.
If you assumed the worst case of 2047 displacement and random adresses half the accesses would fall within the same page so at twice the latency for a replay it would be neutral. With the average displacement between 0 and 2047 usually being far lower than 2047 it should easily be a win.

> I checked and AMD Ryzen doesn't have either of these weird behaviors: simple loads
> are 4 cycles even if the address comes from an ALU operation, and they are still
> 4 cycles when the base register and full address fall in different pages.
>
> The penalty on Skylake is 10 cycles, but in a tight made to trigger the issue on every load you get only 7.5
> cycles average: it seems that when the speculation failure and replay happens, the next load will be a normal
> 5-cycle load, but the one after that will speculate again and try to execute in 4 cycles, so you end up with
> alternating latencies of 10,5,10,5 which average out to 7.5. So the worst case is less bad in Skylake.

All in one dependency chain? Again, Peter wrote that Skylake won't attempt the fast path if it failed for the previous load. There's probably statistics behind this choice as well.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?