You can do two 4-cycle loads per cycle

By: Travis Downs (travis.downs.delete@this.gmail.com), September 20, 2018 4:38 pm
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on September 20, 2018 1:49 am wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on September 19, 2018 6:09 pm wrote:
> > anon (spam.delete.delete@this.this.spam.com) on September 19, 2018 2:42 am wrote:
> >
> > > Or the displacement vs page crossing statistics are vastly different depending on the source.
> >
> > Maybe? Kind of hard to see how though.
> >
> > Displacements are definitely used differently for pointer chasing versus say unrolled linear access
> > of an array (where the address is certain to come from an ALU op) or memcpy. I would expect node
> > based structures to have many small offsets, usually the same offset, while unrolled linear code
> > has a range of offsets which may be big for big unrolls. Hard to imagine it affecting page crossing
> > much There is a small effect to object alignment: e.g., heap objects which are likely targets of
> > pointer chasing might be 16B or 32B aligned always, which reduces the chance of heap crossing a
> > bit - sometimes to zero for small offsets like 8 - but this effect is very weak.
> >
> > Even if displacement distribution is wildly different, however, it's hard to see
> > how you get different page crossing stats at a given displacement. You'd have
> > to have the memory allocators or stack layout or something in on the game.
> >
>
> I don't know. Maybe the unrolled loop is set up in a way that all displacements between
> -2048 and +2047 point to the same page. Compilers might try to optimize for that.
> Maybe the chances are lower to be within the same page after an ALU op.

Note that only non-negative offsets qualify for 4-cycle loads.

>
> I forgot to mention but it seems more likely that it could also be a different distribution
> of displacements or both. So maybe displacements applied to loaded adresses are generally
> lower while those that are large but within +-2047 tend to be well behaved.

Yes it could be. I could be a result of the allocator behavior for medium size objects (around a half a page plus/minus a factor of 2).


> To get back to the unrolling case: If the loop isn't completely unrolled there should be an
> ALU op on the base adress every iteration which would prevent the fast path, wouldn't it?

Yes, I was using the unrolled case as an example of a cases where all loads essentially come from ALU ops (whether unrolled or not really - I just invoked unrolling since that's one thing that produces a variety of offsets).

>
>
> > > Have you tested disp8 vs disp32 for ALU sources?
> >
> > The test I used has no displacement, so is about the best case for no page
> > crossing (impossible unless you are accessing something misaligned).
> >
> >
>
> So the replay block happens with no displacement?

I'm confused here. By "replay block" do you mean the 10,5,10,5 effect on Skylake where a load which gets its address from a failed-fast-path load (replayed) takes the 5-cycle path instead of 4? That's how I thought we were using that term.

If so I wasn't testing anything related to that here: I was testing the case where the load op base reg comes from an ALU op. This never takes the 4-cycle path and so never is eligible for a "replay block" in the first place.

To answer your question though: AFAIK although this test wasn't related, the replay block occurs regardless of the displacement. If the second load has a displacement of zero, but the prior load replayed the second load takes 5-cycles, not 4.

> That would mean page crossing have to be highly correlated for it to be a win.
>
> My idea was that since page crossing are vastly more likely with large displacements they'd only assume
> that the next will fail as well for disp32 and just take the chance for no displacement/disp8.

As far as I know it doesn't do that, but I can double-check.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?