By: anon (spam.delete.delete@this.this.spam.com), September 20, 2018 4:35 am
Room: Moderated Discussions
Wilco (Wilco.dijkstra.delete@this.ntlworld.com) on September 20, 2018 2:32 am wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 20, 2018 1:34 am wrote:
> > Travis Downs (travis.downs.delete@this.gmail.com) on September 19, 2018 5:30 pm wrote:
> > > anon (spam.delete.delete@this.this.spam.com) on September 19, 2018 2:45 am wrote:
> > > > > >
> > > > > > Then how do you explain the restriction? What prevents the use of the
> > > > > > fast path with registers that weren't the result of an earlier load?
> > > > >
> > > > > Hardware doesn't move between pipelines. If we assume 4-cycle loads skip the initial complex address
> > > > > calculation stage (and not a later stage), a 4-cycle load after a 5-cycle load must wait for a cycle
> > > > > simply because the pipeline stages it needs are still being used by the earlier load.
> > > > >
> > > > > Wilco
> > > > >
> > > >
> > > > Am I missing something obvious here?
> > > > 4 cycle loads exist.
> > > > What is the restriction that prevents them when the adress is the result of an ALU op instead of a load?
> > >
> > > The way I understood it is that if you mix 4 and 5 cycle loads, for example, in a "throughput" scenario,
> > > your 4 cycle loads will often end up taking 5 cycles because
> > > the are out of alignment with the 5 cycle loads
> > > and use the same pipeline stages. In the example, the 4 cycle load can't start in the cycle after a 5 cycle
> > > load because it wants the second part of the load pipeline which is what the 5 cycle load is using.
> > >
> > > So it turns into a 5 cycle load. It maybe gets even messier
> > > if the skipped pipeline stages are somewhere in the middle.
> > >
> > > We do know that 4 cycle loads do play nice in a throughput scenario if there are only
> > > 4 cycle loads around since 8 concurrent 4-cycle pointer chases do execute at 2 loads
> > > per cycle. Maybe I could add some 5 cycle loads in there and see what happens.
> >
> > Yeah but even in the nice throughput scenario 4 cycle loads didn't happen with the adress
> > coming from an ALU, right? So the different latency doesn't seem to be the problem.
>
> The 4-cycle path might only work within the load/store unit because of timing.
> Forwarding within a unit is faster than from a different unit.
>
Ah, that's what you meant. Still that is completely unrelated to 5 cycle loads blocking the fast path for the following load.
> However it's not clear this is the case, you need sequences like alu->load5->load->alu
> to check whether the load->alu latency can ever be 4 cycles.
>
> Wilco
Honestly I'm not sure if that could prove anything. It depends on when and in which cycle the forwarding happens. If it's at the start of the cycle then it would be understable that far forward + fast path could run into time trouble while intra-unit forward + fast path doesn't. However it could also mean that the far forward to the ALU happens in the ALU's execution cycle, which might have enough leeway to do it just fine.
> anon (spam.delete.delete@this.this.spam.com) on September 20, 2018 1:34 am wrote:
> > Travis Downs (travis.downs.delete@this.gmail.com) on September 19, 2018 5:30 pm wrote:
> > > anon (spam.delete.delete@this.this.spam.com) on September 19, 2018 2:45 am wrote:
> > > > > >
> > > > > > Then how do you explain the restriction? What prevents the use of the
> > > > > > fast path with registers that weren't the result of an earlier load?
> > > > >
> > > > > Hardware doesn't move between pipelines. If we assume 4-cycle loads skip the initial complex address
> > > > > calculation stage (and not a later stage), a 4-cycle load after a 5-cycle load must wait for a cycle
> > > > > simply because the pipeline stages it needs are still being used by the earlier load.
> > > > >
> > > > > Wilco
> > > > >
> > > >
> > > > Am I missing something obvious here?
> > > > 4 cycle loads exist.
> > > > What is the restriction that prevents them when the adress is the result of an ALU op instead of a load?
> > >
> > > The way I understood it is that if you mix 4 and 5 cycle loads, for example, in a "throughput" scenario,
> > > your 4 cycle loads will often end up taking 5 cycles because
> > > the are out of alignment with the 5 cycle loads
> > > and use the same pipeline stages. In the example, the 4 cycle load can't start in the cycle after a 5 cycle
> > > load because it wants the second part of the load pipeline which is what the 5 cycle load is using.
> > >
> > > So it turns into a 5 cycle load. It maybe gets even messier
> > > if the skipped pipeline stages are somewhere in the middle.
> > >
> > > We do know that 4 cycle loads do play nice in a throughput scenario if there are only
> > > 4 cycle loads around since 8 concurrent 4-cycle pointer chases do execute at 2 loads
> > > per cycle. Maybe I could add some 5 cycle loads in there and see what happens.
> >
> > Yeah but even in the nice throughput scenario 4 cycle loads didn't happen with the adress
> > coming from an ALU, right? So the different latency doesn't seem to be the problem.
>
> The 4-cycle path might only work within the load/store unit because of timing.
> Forwarding within a unit is faster than from a different unit.
>
Ah, that's what you meant. Still that is completely unrelated to 5 cycle loads blocking the fast path for the following load.
> However it's not clear this is the case, you need sequences like alu->load5->load->alu
> to check whether the load->alu latency can ever be 4 cycles.
>
> Wilco
Honestly I'm not sure if that could prove anything. It depends on when and in which cycle the forwarding happens. If it's at the start of the cycle then it would be understable that far forward + fast path could run into time trouble while intra-unit forward + fast path doesn't. However it could also mean that the far forward to the ALU happens in the ALU's execution cycle, which might have enough leeway to do it just fine.