You can do two 4-cycle loads per cycle

By: anon (spam.delete.delete@this.this.spam.com), September 19, 2018 2:42 am
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 3:08 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 1:27 pm wrote:
>
> > Then how do you explain the restriction? What prevents the use of the
> > fast path with registers that weren't the result of an earlier load?
>
> I don't know.
>
> One theory would be that it is a true restriction, i.e., the hardware can't easily support the fast
> path, e.g,. because there is a dedicated path for loads to feed directly back into the load EU, or
> the timing otherwise doesn't work out if an ALU is involved. That's what I thought originally.
>
> The other theory would be that it's totally possible to implement the fast path in general, at least for
> simple addressing, but that it necessarily doesn't pay off outside of pure pointer-chasing because the payoff
> is less direct, the latency penalties are still there and replays have a higher cost in that if you're trying
> to get 2 loads per cycle a replay definitely steals the EU for the second try at the load.
>

Or the displacement vs page crossing statistics are vastly different depending on the source.
Have you tested disp8 vs disp32 for ALU sources?

> Also, as Wilco points out, once you start mixing latencies things become trickier, like both simple and
> complex addressing in the same loop: things might not work out well due to conflicts in the pipeline:
> including inside the load EUs and then for writeback/bypass etc. So if the opposite of "pointer chasing"
> is "load throughput" then maybe you are better of sticking with 5-cycle loads all the time.
>
> I was partial to the first theory, but I think I'm being convinced the second is more likely.
>
> It's clear that's it's possible to do 4 cycles unconditionally, at least with simple addressing, with
> an L1 cache in more or less the Intel style, since Ryzen does it and the cache is fairly similar (at least
> from the outside). The Intel L1D is a bit more capable in terms of misaligned loads and 256-bit loads though,
> so maybe that's where the trade-off lies, or maybe mixing different latencies really does suck.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊