4-cycle L1 latency on Intel not as general as thought

By: Travis Downs (travis.downs.delete@this.gmail.com), September 18, 2018 11:07 am
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 10:53 am wrote:
>
> I meant that in all other scenarios, including stores, the latency should matter much less.

Well yes, for stores "latency" matters very little - to the point where it is hard to even define or test the "latency" for stores.

The best you can do is measure the store-forwarding latency, which on Skylake at least is even lower, in the best case, than an L1 hit at 3 cycles: probably because you can take the address calculate and TLB lookup out of the fast path entire: you just assume your load hit the store based on the low address bits and take a disambiguation speculation failure and flush if you are wrong.

> Everything is complex once you go into details. We can only work with simplifications here.

Sure, yes - but we should acknowledge the complexity or when simplifications are being made. So rather than saying "load latency doens't matter except for pointer chasing cases" we could say something like "load latency matters less in non-pointer chasing scenarios, but how much depends on the code".

>
> > That makes the choice of 2048 interesting, compared to say 1024. For offsets between 1024
> > and 2048, using the "uniformly random accesses" assumption you wouldn't expect the fast path
> > to pay off, on average. I have no doubt the number was carefully chosen however, probably
> > through simulation - so there might be some hidden effect that makes page crossing less common
> > than you'd expect - i.e., the "uniformly random" assumption may be wrong somehow.
> >
>
> Correct, if I'm not mistaken for uniformly random the chance of page crossings would
> be ~37%. If the displacements or memory positions are even slightly biased towards lower
> numbers you can easily get it below the 20%/~16.6% needed to make it worthwhile, especially
> if you halve the penalty on most cases where it does happen like SKL does.

Note that the penalty is not "halved" in Skylake except in the pathological case where every load has the different page behavior. In the "usual" case where only an occasional load has this behavior, the new behavior on Skylake slightly increases the penalty: you get a 6 cycle penalty (10 vs 4) for the naughty load, and an additional ~1 cycle penalty on the next load which takes the 5-cycle path rather than the 4-cycle path even though it probably could have taken the 4-cycle path.

So for this case where we talk about 20% naughty loads or whatever, we can't say the penalty is halved on Skylake (it is probably worse than Haswell).

ISTM the layout would have to be quite biased to get down to the crossover point.



> Oh I wasn't talking about anything complicated like that. After all just checking if the high bits
> are 0 was chosen because it's easy to implement, not because 2047 was the exact cutoff point. So
> just statistics as in "is adding an extra cycle to block once after a failure a net win" or if the
> added cycle exists for other reasons "is the cheapest method enough to fix the regression".

Agreed.

About 2048 vs other values, one would expect they would still try all the easy-to-check power-of-two values, e.g., via simulation.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?