4-cycle L1 latency on Intel not as general as thought

By: anon (spam.delete.delete@this.this.spam.com), September 18, 2018 11:51 am
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 11:07 am wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 10:53 am wrote:
> >
> > I meant that in all other scenarios, including stores, the latency should matter much less.
>
> Well yes, for stores "latency" matters very little - to the point
> where it is hard to even define or test the "latency" for stores.
>

This was about the subset where you've got an ALU op before the store. Like I said I can't imagine many cases where you'd need such a complex adress calculation that the normal adressing modes don't cut it and then hit the same adress immediately afterwards such that the bypass latency from ALU to AGU would matter. Of course it's not bypass delay but that was my idea at the time.

> The best you can do is measure the store-forwarding latency, which on Skylake at least is even lower,
> in the best case, than an L1 hit at 3 cycles: probably because you can take the address calculate
> and TLB lookup out of the fast path entire: you just assume your load hit the store based on the
> low address bits and take a disambiguation speculation failure and flush if you are wrong.
>
> > Everything is complex once you go into details. We can only work with simplifications here.
>
> Sure, yes - but we should acknowledge the complexity or when simplifications are being made. So rather
> than saying "load latency doens't matter except for pointer chasing cases" we could say something like
> "load latency matters less in non-pointer chasing scenarios, but how much depends on the code".
>

That's why I wrote "Similarly for anything but a pointer chase the extra cycle for a load shouldn't matter when you've got other dependency chains that can execute in parallel." and not "latency doesn't matter".
It's literally "shouldn't matter for certain code".

> >
> > > That makes the choice of 2048 interesting, compared to say 1024. For offsets between 1024
> > > and 2048, using the "uniformly random accesses" assumption you wouldn't expect the fast path
> > > to pay off, on average. I have no doubt the number was carefully chosen however, probably
> > > through simulation - so there might be some hidden effect that makes page crossing less common
> > > than you'd expect - i.e., the "uniformly random" assumption may be wrong somehow.
> > >
> >
> > Correct, if I'm not mistaken for uniformly random the chance of page crossings would
> > be ~37%. If the displacements or memory positions are even slightly biased towards lower
> > numbers you can easily get it below the 20%/~16.6% needed to make it worthwhile, especially
> > if you halve the penalty on most cases where it does happen like SKL does.
>
> Note that the penalty is not "halved" in Skylake except in the pathological case where every load
> has the different page behavior. In the "usual" case where only an occasional load has this behavior,
> the new behavior on Skylake slightly increases the penalty: you get a 6 cycle penalty (10 vs 4) for
> the naughty load, and an additional ~1 cycle penalty on the next load which takes the 5-cycle path
> rather than the 4-cycle path even though it probably could have taken the 4-cycle path.
>
> So for this case where we talk about 20% naughty loads or whatever, we can't
> say the penalty is halved on Skylake (it is probably worse than Haswell).
>
> ISTM the layout would have to be quite biased to get down to the crossover point.
>

Please don't be condescending just because you don't understand the argument.

Low displacements will rarely cross pages so the penalty isn't really a concern.
The pathological case and 1024-2047 displacement is what you should be worried about.
Most failures won't happen on the randomly distributed and/or low displacements. When you're near the end of a page or have multiple high displacements in last half or quarter that's where most of them will come from and the blocking does halve the penalty for those.
You agreed that in those cases the penalty is halved so if those are most then it's a clear win.

Of course you can assume instead that someone at Intel decided to put in extra circuitry just for fun even though it's a loss, but that's not the assumption I'm working with.

>
>
> > Oh I wasn't talking about anything complicated like that. After all just checking if the high bits
> > are 0 was chosen because it's easy to implement, not because 2047 was the exact cutoff point. So
> > just statistics as in "is adding an extra cycle to block once after a failure a net win" or if the
> > added cycle exists for other reasons "is the cheapest method enough to fix the regression".
>
> Agreed.
>
> About 2048 vs other values, one would expect they would still try
> all the easy-to-check power-of-two values, e.g., via simulation.

For 512-1023 it's ~19% even if it were uniform and for 2048-4095 it's ~75% so it was always going to be either 1024 or 2048.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?