By: Wilco (Wilco.dijkstra.delete@this.ntlworld.com), December 14, 2018 4:41 am
Room: Moderated Discussions
anon.1 (abc.delete@this.def.com) on December 13, 2018 8:15 pm wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on December 13, 2018 11:55 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 12:27 pm wrote:
> >
> > > Then how do you explain the restriction? What prevents the use of the
> > > fast path with registers that weren't the result of an earlier load?
> >
> >
> > Something that might be related to the 1-cycle longer latency for pointer
> > chasing loops that have an ALU op came up in the Sunny Cove slides:
> >
> >
> >
> > In the "Skylake reminder" it clearly shows that there is a separate RS for the 3
> > AGUs. That's the first I've hard of this possibility (Intel usually describes their
> > scheduler as a single unified scheduler, so it comes as a bit of a surprise).
> >
> > Perhaps a separate scheduler is used because the implementation for loads is more complicated due to replay
> > considerations (although really it complicates all the dependent operations) or because these are the
> > only (?) uops with variable latency. In any case, it could imply that the extra cycle occurs because of
> > cross-RS delays: such as the AGU is the only one that knows whether an op is 4 or 5 latency, so operations
> > in the ALU RS just wake up with conservative value of 5 cycles, which almost always works.
> >
> > Or, the slide could just be wrong. There are other errors there, like where it shows the scalar
> > mul unit on p5 in SKL, but it is actually on p1 there (and so Anandtech subsequently mentioned
> > that the mul unit is moving around based on this incorrect slide). Still it looks deliberate
> > to me how they have broken up the RS boxes, not just a quirk of the diagram.
> >
> > Or, the slide could be right but the separate RS has no practical impact on the load->ALU latency.
>
> I had heard about the split AGU from someone who did microbenchmarks, much like you :) According to
> this person, intel had a 64 e "unified scheduler[1]" for non-agen, and a 32 entry Agen scheduler,
> resulting in a total of 96. The previous scheduler (Haswell) was 64 entries. I had mentioned it once
> on this forum and someone said that wasn't what the optimization guide says, so I didn't push it further
> (and I wasn't too bothered to test it for myself). Good to see them acknowledge it.
>
> Separating agen could be for the following reasons: (1) Agen ops don't need to do a tag broadcast, so you can
> take them out of the main single-cycle loop and maybe physically move them closer to the load/store unit (provided
> that you don't increase the ALU->AGU tag and data latency in doing so). (2) Reduces the number of read and write
> ports required on your "unified scheduler". Zen is logically pretty much like this, if you recall: 4x ALU, 2x
> AGU queues. Only that intel has merged the AGU queues and may have merged the ALU queues.
>
> A separate store data scheduler is also likely for similar reasons: you don't need to
> do a tag broadcast for store data so why have it compete with tag-broadcasting ops? ANd
> you can move the scheduler physically closer to the store queue. 2 stores is interesting
> though... is it 2 stores generally (including AVX-512?) or with restrictions?
>
>
> [1] unified meaning that the person could not figure out how it was partitioned, if at all. I suspect
> it is partitioned somehow, but it's not easy to make out from directed tests. I am not convinced that
> the ALU scheduler is unified either, but that seems hard to test, so it's at least symmetrical.
It should be easy to test by having a long sequence of multiplies followed by independent ALU ops that use a different port (eg. shift). If they fit in the RS they execute in parallel, if not you get more sequential execution.
Wilco
> Travis Downs (travis.downs.delete@this.gmail.com) on December 13, 2018 11:55 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 12:27 pm wrote:
> >
> > > Then how do you explain the restriction? What prevents the use of the
> > > fast path with registers that weren't the result of an earlier load?
> >
> >
> > Something that might be related to the 1-cycle longer latency for pointer
> > chasing loops that have an ALU op came up in the Sunny Cove slides:
> >
> >

> >
> > In the "Skylake reminder" it clearly shows that there is a separate RS for the 3
> > AGUs. That's the first I've hard of this possibility (Intel usually describes their
> > scheduler as a single unified scheduler, so it comes as a bit of a surprise).
> >
> > Perhaps a separate scheduler is used because the implementation for loads is more complicated due to replay
> > considerations (although really it complicates all the dependent operations) or because these are the
> > only (?) uops with variable latency. In any case, it could imply that the extra cycle occurs because of
> > cross-RS delays: such as the AGU is the only one that knows whether an op is 4 or 5 latency, so operations
> > in the ALU RS just wake up with conservative value of 5 cycles, which almost always works.
> >
> > Or, the slide could just be wrong. There are other errors there, like where it shows the scalar
> > mul unit on p5 in SKL, but it is actually on p1 there (and so Anandtech subsequently mentioned
> > that the mul unit is moving around based on this incorrect slide). Still it looks deliberate
> > to me how they have broken up the RS boxes, not just a quirk of the diagram.
> >
> > Or, the slide could be right but the separate RS has no practical impact on the load->ALU latency.
>
> I had heard about the split AGU from someone who did microbenchmarks, much like you :) According to
> this person, intel had a 64 e "unified scheduler[1]" for non-agen, and a 32 entry Agen scheduler,
> resulting in a total of 96. The previous scheduler (Haswell) was 64 entries. I had mentioned it once
> on this forum and someone said that wasn't what the optimization guide says, so I didn't push it further
> (and I wasn't too bothered to test it for myself). Good to see them acknowledge it.
>
> Separating agen could be for the following reasons: (1) Agen ops don't need to do a tag broadcast, so you can
> take them out of the main single-cycle loop and maybe physically move them closer to the load/store unit (provided
> that you don't increase the ALU->AGU tag and data latency in doing so). (2) Reduces the number of read and write
> ports required on your "unified scheduler". Zen is logically pretty much like this, if you recall: 4x ALU, 2x
> AGU queues. Only that intel has merged the AGU queues and may have merged the ALU queues.
>
> A separate store data scheduler is also likely for similar reasons: you don't need to
> do a tag broadcast for store data so why have it compete with tag-broadcasting ops? ANd
> you can move the scheduler physically closer to the store queue. 2 stores is interesting
> though... is it 2 stores generally (including AVX-512?) or with restrictions?
>
>
> [1] unified meaning that the person could not figure out how it was partitioned, if at all. I suspect
> it is partitioned somehow, but it's not easy to make out from directed tests. I am not convinced that
> the ALU scheduler is unified either, but that seems hard to test, so it's at least symmetrical.
It should be easy to test by having a long sequence of multiplies followed by independent ALU ops that use a different port (eg. shift). If they fit in the RS they execute in parallel, if not you get more sequential execution.
Wilco