Separate RS for ALU vs load/store

By: anon (spam.delete.delete.delete@this.this.this.spam.com), December 13, 2018 2:14 pm
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on December 13, 2018 11:55 am wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 12:27 pm wrote:
>
> > Then how do you explain the restriction? What prevents the use of the
> > fast path with registers that weren't the result of an earlier load?
>
>
> Something that might be related to the 1-cycle longer latency for pointer
> chasing loops that have an ALU op came up in the Sunny Cove slides:
>
> SKL vs ICL
>
> In the "Skylake reminder" it clearly shows that there is a separate RS for the 3
> AGUs. That's the first I've hard of this possibility (Intel usually describes their
> scheduler as a single unified scheduler, so it comes as a bit of a surprise).
>

Indeed.

> Perhaps a separate scheduler is used because the implementation for loads is more complicated due to replay
> considerations (although really it complicates all the dependent operations) or because these are the
> only (?) uops with variable latency. In any case, it could imply that the extra cycle occurs because of
> cross-RS delays: such as the AGU is the only one that knows whether an op is 4 or 5 latency, so operations
> in the ALU RS just wake up with conservative value of 5 cycles, which almost always works.
>

Or the delay exists regardless and it just didn't make sense to have a truly unified scheduler when it didn't improve anything. The benefit of a common RS for all AGUs is obvious for pointer chasing and load balancing, but there's no need for sharing with the ALUs.

Maybe simple load + bypass to AGU is doable within 4 cycles and complex load + bypass to ALU is doable within 5 cycles but simple load + bypass to ALU doesn't quite fit within 4 cycles due to the physical distance.

> Or, the slide could just be wrong. There are other errors there, like where it shows the scalar
> mul unit on p5 in SKL, but it is actually on p1 there (and so Anandtech subsequently mentioned
> that the mul unit is moving around based on this incorrect slide). Still it looks deliberate
> to me how they have broken up the RS boxes, not just a quirk of the diagram.
>
> Or, the slide could be right but the separate RS has no practical impact on the load->ALU latency.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊