You can do two 4-cycle loads per cycle

By: anon (spam.delete.delete@this.this.spam.com), September 20, 2018 1:49 am
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on September 19, 2018 6:09 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 19, 2018 2:42 am wrote:
>
> > Or the displacement vs page crossing statistics are vastly different depending on the source.
>
> Maybe? Kind of hard to see how though.
>
> Displacements are definitely used differently for pointer chasing versus say unrolled linear access
> of an array (where the address is certain to come from an ALU op) or memcpy. I would expect node
> based structures to have many small offsets, usually the same offset, while unrolled linear code
> has a range of offsets which may be big for big unrolls. Hard to imagine it affecting page crossing
> much There is a small effect to object alignment: e.g., heap objects which are likely targets of
> pointer chasing might be 16B or 32B aligned always, which reduces the chance of heap crossing a
> bit - sometimes to zero for small offsets like 8 - but this effect is very weak.
>
> Even if displacement distribution is wildly different, however, it's hard to see
> how you get different page crossing stats at a given displacement. You'd have
> to have the memory allocators or stack layout or something in on the game.
>

I don't know. Maybe the unrolled loop is set up in a way that all displacements between -2048 and +2047 point to the same page. Compilers might try to optimize for that. Maybe the chances are lower to be within the same page after an ALU op.

I forgot to mention but it seems more likely that it could also be a different distribution of displacements or both. So maybe displacements applied to loaded adresses are generally lower while those that are large but within +-2047 tend to be well behaved.

To get back to the unrolling case: If the loop isn't completely unrolled there should be an ALU op on the base adress every iteration which would prevent the fast path, wouldn't it?


> > Have you tested disp8 vs disp32 for ALU sources?
>
> The test I used has no displacement, so is about the best case for no page
> crossing (impossible unless you are accessing something misaligned).
>
>

So the replay block happens with no displacement?
That would mean page crossing have to be highly correlated for it to be a win.

My idea was that since page crossing are vastly more likely with large displacements they'd only assume that the next will fail as well for disp32 and just take the chance for no displacement/disp8.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊