4-cycle L1 latency on Intel not as general as thought

By: Travis Downs (travis.downs.delete@this.gmail.com), September 19, 2018 5:20 pm
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on September 19, 2018 2:40 am wrote:

> You immediately default to explaining the same thing over and over again. If you asked for clarification
> or told me I suck at explaining my argument it'd be fine, I'd just have to post a more detailed explanation,
> but this forces me to read throw the same thing over and over again each time longer and "easier to
> understand" than the last and it's somewhere between annoying and insulting.

Yes, I wasn't understanding your math so rather than demanding a better explanation I tried to explain my side more clearly, because maybe you weren't getting it. In the future, if it helps, feel free to treat repeat explanations on my part as code for "your explanations suck". It would be easier to me to write less and just rely on the other person to elaborate their explanation. If you also prefer that, perfect!

>
> Correct, which means that reality must be quite a bit different than random displacements.

Yes, but based on my calculation, even if displacements were clustered in the most skewed possible way, i.e,. every displacement in the range [1024, 2047] was actually exactly 1024 it still wouldn't pay off. So even with the most extreme assumption you need something else: like assuming pointers are more likely to point into the first part of a page (this likely in fact, especially for processes which don't allocate much - but I don't know how strong the effect is).

The displacement thing would be easy to check, the pointer value one not so much.

>
> > Even if you assume lower offsets are much more common, it doesn't make 2048 better: you have to assume also
> > that pointers tend to be more often in the first half of the page, and that's what I find fairly unlikely.
> >
>
> That's wrong. 4096 displacement will always lead to a page crossing whereas
> a displacement of 1 can only case a page crossing for 1 in 4096 pointers.

What's wrong? You are doing a bad job of explaining what claim I made and why it is wrong. Your comment doens't seem very related (to me) to the part you quoted. Of course a 4096 always leads to a page crossing.

>
> > > Most failures won't happen on the randomly distributed and/or low displacements. When you're
> > > near the end of a page or have multiple high displacements in last half or quarter that's
> > > where most of them will come from and the blocking does halve the penalty for those.
> > > You agreed that in those cases the penalty is halved so if those are most then it's a clear win.
> >
> > The new Skylake behavior helps in the worst case when you have many back-to-back different page
> > addresses, i.e., they are bursty/highly correlated. It doesn't help if the offsets are large and
> > the address distributions are more or less random, so the different page case is more or less
> > randomly distributed, unless the probability is very high (i.e., approaches worst case).
> >
> > If you run some numbers you should see that with the "uniform distribution non-bursty" assumption the new
> > behavior cuts down on the penalty for high offsets, but not anywhere close to half.
>
> Like I said I think that most of the penalties will happen in bursty
> cases, but feel free to show some numbers that say otherwise.
> If you want to argue against a strawman "in the exact opposite
> case that's not true" feel free to do so as well.

I don't know. It's definitely not exactly the opposite though (no correlation).

My initial thought was that most crossing would be through more or less random cases, especially with short chases which are common in programs (think a couple of dereferences in a chain, not "traversing a big linked list"). That is, these short cases eliminate really bursty behavior by design since there is no big chain in the first place.

Bursty seems very likely for certain structures with small nodes, so I think it is common as well.

One important about bursty is that the problem could be very accute: like it actually makes a noticeable regression in benchmark X. So if you don't fix bursty, you might get an actual regression SKL vs BDW for example. The other crossings are just disperse everywhere so it's hard to see them causing a regression if you have some small IPC uplift elsewhere to cancel it out.



> I'm not sure what your problem is. You're simultaneously arguing against highly correlated
> cases being common, which would mean that the SKL behaviour is a loss, while arguing that it
> makes sense because it's "an obvious big win in worst case and highly correlated cases".

I don't find it weird. I'm not trying to score points or evaluate good or bad. The claim that "X is not used useful in case Y but really useful case Z" seems entirely consistent.

I'm not sure I ever said highly correlated cases aren't common, but maybe I said "pathological" (which is different, but may imply uncommon) and called the other case "usual" once. Other than that I think put in all the correct "ifs".

For the record my view is immediately above about correlated cases. I did change my position a bit after I thought about how node-based structures are often allocated, so I definitely think they are more common than originally. I also wouldn't be surprised if your claim that most crossings appear in correlated cases is true, but it would be hard to agree on a representative set of programs.

I realized actually testing this is easier than I thought: `perf mem` will do it: you get a list of access addresses and IPs: you could parse that and the disassembly and actually get a metric here more easily than messing around with pintool or valgrind or whatever.

> Make up your mind. It can't be a big win and at the same time not be a big win.

It's not confusing - it's a big win in the bursty cases. Most likely that's enough to make it a small win overall that's why we see it (or it's a weird bug).

>
> > > > About 2048 vs other values, one would expect they would still try
> > > > all the easy-to-check power-of-two values, e.g., via simulation.
> > >
> > > For 512-1023 it's ~19% even if it were uniform and for 2048-4095
> > > it's ~75% so it was always going to be either 1024 or 2048.
> >
> > Well sure, if the various biasing were stronger or weaker, or the Skylake-like mitigations
> > changed the calculations, you could easily get a smaller number or 4096.
> >
> > If you just assume totally uniform, then 512 seems to maybe come out on top in a Haswell-like chip.
>
> No, the optimum is ~820, 1024 is slightly better than 512.

If nothing else least our math seems to agree!

I said "maybe comes out on top" because the difference between the two was so small at point (something like less than 0.1%) and I assigned some value to the execution cost of the replay as well (extra uop, possibly waking the dependent instruction unnecessarily), beyond the latency penalty and came up with "maybe".
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?