4-cycle L1 latency on Intel not as general as thought

By: anon (spam.delete.delete@this.this.spam.com), September 20, 2018 8:40 am
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on September 19, 2018 5:20 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 19, 2018 2:40 am wrote:
>
> > You immediately default to explaining the same thing over and over again. If you asked for clarification
> > or told me I suck at explaining my argument it'd be fine,
> > I'd just have to post a more detailed explanation,
> > but this forces me to read throw the same thing over and over again each time longer and "easier to
> > understand" than the last and it's somewhere between annoying and insulting.
>
> Yes, I wasn't understanding your math so rather than demanding a better explanation I tried to explain my
> side more clearly, because maybe you weren't getting it. In the future, if it helps, feel free to treat
> repeat explanations on my part as code for "your explanations suck". It would be easier to me to write less
> and just rely on the other person to elaborate their explanation. If you also prefer that, perfect!
>
> >
> > Correct, which means that reality must be quite a bit different than random displacements.
>
> Yes, but based on my calculation, even if displacements were clustered in the most skewed possible
> way, i.e,. every displacement in the range [1024, 2047] was actually exactly 1024 it still wouldn't
> pay off. So even with the most extreme assumption you need something else: like assuming pointers
> are more likely to point into the first part of a page (this likely in fact, especially for processes
> which don't allocate much - but I don't know how strong the effect is).
>

That's what I was missing. I just couldn't figure why you thought the base adresses would have to be in the first half of the page.
First of all it would be interesting to know if the check is actuall for [0, 2047] or [-2048, 2047].
Secondly I should specified that it's not just about the displacements themselves but the combination of base and displacement. See the unrolled loop case we spoke about in other posts. The displacements might very well be random but all the eventual target adresses could still fall within a single page.

> The displacement thing would be easy to check, the pointer value one not so much.
>
> >
> > > Even if you assume lower offsets are much more common, it doesn't make 2048 better: you have to assume also
> > > that pointers tend to be more often in the first half of the page, and that's what I find fairly unlikely.
> > >
> >
> > That's wrong. 4096 displacement will always lead to a page crossing whereas
> > a displacement of 1 can only case a page crossing for 1 in 4096 pointers.
>
> What's wrong? You are doing a bad job of explaining what claim I made and why it is wrong. Your comment doens't
> seem very related (to me) to the part you quoted. Of course a 4096 always leads to a page crossing.
>

See above.

> >
> > > > Most failures won't happen on the randomly distributed and/or low displacements. When you're
> > > > near the end of a page or have multiple high displacements in last half or quarter that's
> > > > where most of them will come from and the blocking does halve the penalty for those.
> > > > You agreed that in those cases the penalty is halved so if those are most then it's a clear win.
> > >
> > > The new Skylake behavior helps in the worst case when you have many back-to-back different page
> > > addresses, i.e., they are bursty/highly correlated. It doesn't help if the offsets are large and
> > > the address distributions are more or less random, so the different page case is more or less
> > > randomly distributed, unless the probability is very high (i.e., approaches worst case).
> > >
> > > If you run some numbers you should see that with the "uniform distribution non-bursty" assumption the new
> > > behavior cuts down on the penalty for high offsets, but not anywhere close to half.
> >
> > Like I said I think that most of the penalties will happen in bursty
> > cases, but feel free to show some numbers that say otherwise.
> > If you want to argue against a strawman "in the exact opposite
> > case that's not true" feel free to do so as well.
>
> I don't know. It's definitely not exactly the opposite though (no correlation).
>
> My initial thought was that most crossing would be through more or less random cases,
> especially with short chases which are common in programs (think a couple of dereferences
> in a chain, not "traversing a big linked list"). That is, these short cases eliminate
> really bursty behavior by design since there is no big chain in the first place.
>
> Bursty seems very likely for certain structures with small nodes, so I think it is common as well.
>
> One important about bursty is that the problem could be very accute: like it actually makes a
> noticeable regression in benchmark X. So if you don't fix bursty, you might get an actual regression
> SKL vs BDW for example. The other crossings are just disperse everywhere so it's hard to see them
> causing a regression if you have some small IPC uplift elsewhere to cancel it out.
>

I guess that clears up that.

>
>
> > I'm not sure what your problem is. You're simultaneously arguing against highly correlated
> > cases being common, which would mean that the SKL behaviour is a loss, while arguing that it
> > makes sense because it's "an obvious big win in worst case and highly correlated cases".
>
> I don't find it weird. I'm not trying to score points or evaluate good or bad. The claim that
> "X is not used useful in case Y but really useful case Z" seems entirely consistent.
>
> I'm not sure I ever said highly correlated cases aren't common, but maybe I said
> "pathological" (which is different, but may imply uncommon) and called the other
> case "usual" once. Other than that I think put in all the correct "ifs".
>
> For the record my view is immediately above about correlated cases. I did change my position a bit
> after I thought about how node-based structures are often allocated, so I definitely think they are
> more common than originally. I also wouldn't be surprised if your claim that most crossings appear
> in correlated cases is true, but it would be hard to agree on a representative set of programs.
>

Yeah, the wording combined with a slightly changing position probably just made me misunderstand that.

The thing is compilers are incredibly good at paying attention to these tiny details so I'd expect most cases to happen either when the compiler knows about it, but the tradeoff for different adressing modes (possibly even due to encoding or other tiny details) or an extra base register isn't worth it, or when something has gone horribly wrong.

> I realized actually testing this is easier than I thought: `perf mem` will do it: you get
> a list of access addresses and IPs: you could parse that and the disassembly and actually
> get a metric here more easily than messing around with pintool or valgrind or whatever.
>
> > Make up your mind. It can't be a big win and at the same time not be a big win.
>
> It's not confusing - it's a big win in the bursty cases. Most likely that's enough
> to make it a small win overall that's why we see it (or it's a weird bug).
>

I haven't done the exact numbers for the non-bursty cases, but at first glance it seems to be slight improvement.
The problem is that it's overall still a loss. It wouldn't make sense to improve the performance of the [1024, 2047] intervall if it's still an overall loss when they could just remove all those cases entirely by stopping at 1023.

The whole argument just becomes weird when your conclusion is that the fast path block lessens the negative performance impact of high displacements. You'd have to be working from the assumption that high displacements are included despite lowering the performance.

> >
> > > > > About 2048 vs other values, one would expect they would still try
> > > > > all the easy-to-check power-of-two values, e.g., via simulation.
> > > >
> > > > For 512-1023 it's ~19% even if it were uniform and for 2048-4095
> > > > it's ~75% so it was always going to be either 1024 or 2048.
> > >
> > > Well sure, if the various biasing were stronger or weaker, or the Skylake-like mitigations
> > > changed the calculations, you could easily get a smaller number or 4096.
> > >
> > > If you just assume totally uniform, then 512 seems to maybe come out on top in a Haswell-like chip.
> >
> > No, the optimum is ~820, 1024 is slightly better than 512.
>
> If nothing else least our math seems to agree!
>
> I said "maybe comes out on top" because the difference between the two was so small at point (something like
> less than 0.1%) and I assigned some value to the execution cost of the replay as well (extra uop, possibly
> waking the dependent instruction unnecessarily), beyond the latency penalty and came up with "maybe".

That makes sense.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?