Separate RS for ALU vs load/store

By: Wilco (Wilco.dijkstra.delete@this.ntlworld.com), December 14, 2018 4:41 am
Room: Moderated Discussions
anon.1 (abc.delete@this.def.com) on December 13, 2018 8:15 pm wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on December 13, 2018 11:55 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 12:27 pm wrote:
> >
> > > Then how do you explain the restriction? What prevents the use of the
> > > fast path with registers that weren't the result of an earlier load?
> >
> >
> > Something that might be related to the 1-cycle longer latency for pointer
> > chasing loops that have an ALU op came up in the Sunny Cove slides:
> >
> > SKL vs ICL
> >
> > In the "Skylake reminder" it clearly shows that there is a separate RS for the 3
> > AGUs. That's the first I've hard of this possibility (Intel usually describes their
> > scheduler as a single unified scheduler, so it comes as a bit of a surprise).
> >
> > Perhaps a separate scheduler is used because the implementation for loads is more complicated due to replay
> > considerations (although really it complicates all the dependent operations) or because these are the
> > only (?) uops with variable latency. In any case, it could imply that the extra cycle occurs because of
> > cross-RS delays: such as the AGU is the only one that knows whether an op is 4 or 5 latency, so operations
> > in the ALU RS just wake up with conservative value of 5 cycles, which almost always works.
> >
> > Or, the slide could just be wrong. There are other errors there, like where it shows the scalar
> > mul unit on p5 in SKL, but it is actually on p1 there (and so Anandtech subsequently mentioned
> > that the mul unit is moving around based on this incorrect slide). Still it looks deliberate
> > to me how they have broken up the RS boxes, not just a quirk of the diagram.
> >
> > Or, the slide could be right but the separate RS has no practical impact on the load->ALU latency.
>
> I had heard about the split AGU from someone who did microbenchmarks, much like you :) According to
> this person, intel had a 64 e "unified scheduler[1]" for non-agen, and a 32 entry Agen scheduler,
> resulting in a total of 96. The previous scheduler (Haswell) was 64 entries. I had mentioned it once
> on this forum and someone said that wasn't what the optimization guide says, so I didn't push it further
> (and I wasn't too bothered to test it for myself). Good to see them acknowledge it.
>
> Separating agen could be for the following reasons: (1) Agen ops don't need to do a tag broadcast, so you can
> take them out of the main single-cycle loop and maybe physically move them closer to the load/store unit (provided
> that you don't increase the ALU->AGU tag and data latency in doing so). (2) Reduces the number of read and write
> ports required on your "unified scheduler". Zen is logically pretty much like this, if you recall: 4x ALU, 2x
> AGU queues. Only that intel has merged the AGU queues and may have merged the ALU queues.
>
> A separate store data scheduler is also likely for similar reasons: you don't need to
> do a tag broadcast for store data so why have it compete with tag-broadcasting ops? ANd
> you can move the scheduler physically closer to the store queue. 2 stores is interesting
> though... is it 2 stores generally (including AVX-512?) or with restrictions?
>
>
> [1] unified meaning that the person could not figure out how it was partitioned, if at all. I suspect
> it is partitioned somehow, but it's not easy to make out from directed tests. I am not convinced that
> the ALU scheduler is unified either, but that seems hard to test, so it's at least symmetrical.

It should be easy to test by having a long sequence of multiplies followed by independent ALU ops that use a different port (eg. shift). If they fit in the RS they execute in parallel, if not you get more sequential execution.

Wilco
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?