Separate RS for ALU vs load/store

By: anon.1 (abc.delete@this.def.com), December 13, 2018 9:15 pm
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on December 13, 2018 11:55 am wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 12:27 pm wrote:
>
> > Then how do you explain the restriction? What prevents the use of the
> > fast path with registers that weren't the result of an earlier load?
>
>
> Something that might be related to the 1-cycle longer latency for pointer
> chasing loops that have an ALU op came up in the Sunny Cove slides:
>
> SKL vs ICL
>
> In the "Skylake reminder" it clearly shows that there is a separate RS for the 3
> AGUs. That's the first I've hard of this possibility (Intel usually describes their
> scheduler as a single unified scheduler, so it comes as a bit of a surprise).
>
> Perhaps a separate scheduler is used because the implementation for loads is more complicated due to replay
> considerations (although really it complicates all the dependent operations) or because these are the
> only (?) uops with variable latency. In any case, it could imply that the extra cycle occurs because of
> cross-RS delays: such as the AGU is the only one that knows whether an op is 4 or 5 latency, so operations
> in the ALU RS just wake up with conservative value of 5 cycles, which almost always works.
>
> Or, the slide could just be wrong. There are other errors there, like where it shows the scalar
> mul unit on p5 in SKL, but it is actually on p1 there (and so Anandtech subsequently mentioned
> that the mul unit is moving around based on this incorrect slide). Still it looks deliberate
> to me how they have broken up the RS boxes, not just a quirk of the diagram.
>
> Or, the slide could be right but the separate RS has no practical impact on the load->ALU latency.

I had heard about the split AGU from someone who did microbenchmarks, much like you :) According to this person, intel had a 64 e "unified scheduler[1]" for non-agen, and a 32 entry Agen scheduler, resulting in a total of 96. The previous scheduler (Haswell) was 64 entries. I had mentioned it once on this forum and someone said that wasn't what the optimization guide says, so I didn't push it further (and I wasn't too bothered to test it for myself). Good to see them acknowledge it.

Separating agen could be for the following reasons: (1) Agen ops don't need to do a tag broadcast, so you can take them out of the main single-cycle loop and maybe physically move them closer to the load/store unit (provided that you don't increase the ALU->AGU tag and data latency in doing so). (2) Reduces the number of read and write ports required on your "unified scheduler". Zen is logically pretty much like this, if you recall: 4x ALU, 2x AGU queues. Only that intel has merged the AGU queues and may have merged the ALU queues.

A separate store data scheduler is also likely for similar reasons: you don't need to do a tag broadcast for store data so why have it compete with tag-broadcasting ops? ANd you can move the scheduler physically closer to the store queue. 2 stores is interesting though... is it 2 stores generally (including AVX-512?) or with restrictions?


[1] unified meaning that the person could not figure out how it was partitioned, if at all. I suspect it is partitioned somehow, but it's not easy to make out from directed tests. I am not convinced that the ALU scheduler is unified either, but that seems hard to test, so it's at least symmetrical.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?