4-cycle L1 latency on Intel not as general as thought

By: Travis Downs (travis.downs.delete@this.gmail.com), September 20, 2018 3:01 pm
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on September 20, 2018 8:40 am wrote:
> First of all it would be interesting to know if the check is actuall for [0, 2047] or [-2048, 2047].

It's for [0, 2047]. Negative displacements never use the 4-cycle fast path (on Intel at least, I haven't checked on AMD).

> Secondly I should specified that it's not just about the displacements themselves but the combination
> of base and displacement. See the unrolled loop case we spoke about in other posts. The displacements
> might very well be random but all the eventual target adresses could still fall within a single page.

Yes, I get that. Both distributions are important and also their correlation. I imagine the correlation is weak for large displacements (i.e., the displacement doesn't imply much about the base address) - but some weird allocator fact could mean I'm wrong here.

My current basic assumption is that "random uniform" is probably close to the truth for base pointers, at least for objects up to about a page in size, for large applications allocating lots of memory of different sizes since any deviation from relatively uniform seems to imply something weird about the allocator (e.g., that it wastes a lot of memory).

For large objects, I expect many allocations to be page aligned (i.e., (base & -4096) == 0), but I'm not sure if that matters.

For the displacement distribution, I expect it be most smooth and heavily skewed towards small displacements, and perhaps self-similar in different ranges (e.g., maybe the 128 to 256 histogram looks the same as 256 to 512 after normalization).

> See above.

I didn't get it but it seems like we mostly agree so whatever. If you think there is something still being debated make above it clear and I'm still game.


> The thing is compilers are incredibly good at paying attention to these tiny details so
> I'd expect most cases to happen either when the compiler knows about it, but the tradeoff
> for different adressing modes (possibly even due to encoding or other tiny details) or
> an extra base register isn't worth it, or when something has gone horribly wrong.

I think a lot of this is largely out of the hands of the compilers and in the hands of developer (who basically chooses the structure layout) and the memory allocators.

Maybe I misunderstood about "these tiny details", but I think the level of the sophistication of the compilers stops right around the line of understanding the difference between 4 and 5 cycle encodings at a high level. Nothing more detailed that that (like anything about replays, page crossings, etc) is modeled or probably even understood. I doubt very much even that compilers are able to model "oh, this is likely to be a pointer chasing dep chain, lets try to get a 4-cycle encoding here", although certainly there are kind of heuristic things like trying to get a simple encoding in general.

Not slagging the compilers at all: they are doing a good job, but they aren't at the bleeding edge of minutiae like this (and largely for good reason since there is probably much lower hanging fruit). Even many well-understood uarch-specific micro-optimizations aren't applied.


> The whole argument just becomes weird when your conclusion is that the fast path block lessens
> the negative performance impact of high displacements. You'd have to be working from the
> assumption that high displacements are included despite lowering the performance.

Yes, it's confusing. It would be good to have the exact real distribution for {displacement x addresss} that shows [1024, 2047] being beneficial, or at least whatever Intel used, before trying to understand the block behavior, but we don't have that. So we choose some other distribution like "uniform" and look what blocking does there, but yes you might get the wrong answer. At least quantitatively. Maybe you can still make some qualitative claims that hold even under a different distribution.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?