By: anon (spam.delete.delete.delete@this.this.this.spam.com), December 13, 2018 2:14 pm
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on December 13, 2018 11:55 am wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 12:27 pm wrote:
>
> > Then how do you explain the restriction? What prevents the use of the
> > fast path with registers that weren't the result of an earlier load?
>
>
> Something that might be related to the 1-cycle longer latency for pointer
> chasing loops that have an ALU op came up in the Sunny Cove slides:
>
>
>
> In the "Skylake reminder" it clearly shows that there is a separate RS for the 3
> AGUs. That's the first I've hard of this possibility (Intel usually describes their
> scheduler as a single unified scheduler, so it comes as a bit of a surprise).
>
Indeed.
> Perhaps a separate scheduler is used because the implementation for loads is more complicated due to replay
> considerations (although really it complicates all the dependent operations) or because these are the
> only (?) uops with variable latency. In any case, it could imply that the extra cycle occurs because of
> cross-RS delays: such as the AGU is the only one that knows whether an op is 4 or 5 latency, so operations
> in the ALU RS just wake up with conservative value of 5 cycles, which almost always works.
>
Or the delay exists regardless and it just didn't make sense to have a truly unified scheduler when it didn't improve anything. The benefit of a common RS for all AGUs is obvious for pointer chasing and load balancing, but there's no need for sharing with the ALUs.
Maybe simple load + bypass to AGU is doable within 4 cycles and complex load + bypass to ALU is doable within 5 cycles but simple load + bypass to ALU doesn't quite fit within 4 cycles due to the physical distance.
> Or, the slide could just be wrong. There are other errors there, like where it shows the scalar
> mul unit on p5 in SKL, but it is actually on p1 there (and so Anandtech subsequently mentioned
> that the mul unit is moving around based on this incorrect slide). Still it looks deliberate
> to me how they have broken up the RS boxes, not just a quirk of the diagram.
>
> Or, the slide could be right but the separate RS has no practical impact on the load->ALU latency.
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 12:27 pm wrote:
>
> > Then how do you explain the restriction? What prevents the use of the
> > fast path with registers that weren't the result of an earlier load?
>
>
> Something that might be related to the 1-cycle longer latency for pointer
> chasing loops that have an ALU op came up in the Sunny Cove slides:
>
>

>
> In the "Skylake reminder" it clearly shows that there is a separate RS for the 3
> AGUs. That's the first I've hard of this possibility (Intel usually describes their
> scheduler as a single unified scheduler, so it comes as a bit of a surprise).
>
Indeed.
> Perhaps a separate scheduler is used because the implementation for loads is more complicated due to replay
> considerations (although really it complicates all the dependent operations) or because these are the
> only (?) uops with variable latency. In any case, it could imply that the extra cycle occurs because of
> cross-RS delays: such as the AGU is the only one that knows whether an op is 4 or 5 latency, so operations
> in the ALU RS just wake up with conservative value of 5 cycles, which almost always works.
>
Or the delay exists regardless and it just didn't make sense to have a truly unified scheduler when it didn't improve anything. The benefit of a common RS for all AGUs is obvious for pointer chasing and load balancing, but there's no need for sharing with the ALUs.
Maybe simple load + bypass to AGU is doable within 4 cycles and complex load + bypass to ALU is doable within 5 cycles but simple load + bypass to ALU doesn't quite fit within 4 cycles due to the physical distance.
> Or, the slide could just be wrong. There are other errors there, like where it shows the scalar
> mul unit on p5 in SKL, but it is actually on p1 there (and so Anandtech subsequently mentioned
> that the mul unit is moving around based on this incorrect slide). Still it looks deliberate
> to me how they have broken up the RS boxes, not just a quirk of the diagram.
>
> Or, the slide could be right but the separate RS has no practical impact on the load->ALU latency.