By: Wilco (Wilco.dijkstra.delete@this.ntlworld.com), September 20, 2018 2:32 am
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on September 20, 2018 1:34 am wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on September 19, 2018 5:30 pm wrote:
> > anon (spam.delete.delete@this.this.spam.com) on September 19, 2018 2:45 am wrote:
> > > > >
> > > > > Then how do you explain the restriction? What prevents the use of the
> > > > > fast path with registers that weren't the result of an earlier load?
> > > >
> > > > Hardware doesn't move between pipelines. If we assume 4-cycle loads skip the initial complex address
> > > > calculation stage (and not a later stage), a 4-cycle load after a 5-cycle load must wait for a cycle
> > > > simply because the pipeline stages it needs are still being used by the earlier load.
> > > >
> > > > Wilco
> > > >
> > >
> > > Am I missing something obvious here?
> > > 4 cycle loads exist.
> > > What is the restriction that prevents them when the adress is the result of an ALU op instead of a load?
> >
> > The way I understood it is that if you mix 4 and 5 cycle loads, for example, in a "throughput" scenario,
> > your 4 cycle loads will often end up taking 5 cycles because
> > the are out of alignment with the 5 cycle loads
> > and use the same pipeline stages. In the example, the 4 cycle load can't start in the cycle after a 5 cycle
> > load because it wants the second part of the load pipeline which is what the 5 cycle load is using.
> >
> > So it turns into a 5 cycle load. It maybe gets even messier
> > if the skipped pipeline stages are somewhere in the middle.
> >
> > We do know that 4 cycle loads do play nice in a throughput scenario if there are only
> > 4 cycle loads around since 8 concurrent 4-cycle pointer chases do execute at 2 loads
> > per cycle. Maybe I could add some 5 cycle loads in there and see what happens.
>
> Yeah but even in the nice throughput scenario 4 cycle loads didn't happen with the adress
> coming from an ALU, right? So the different latency doesn't seem to be the problem.
The 4-cycle path might only work within the load/store unit because of timing. Forwarding within a unit is faster than from a different unit.
However it's not clear this is the case, you need sequences like alu->load5->load->alu to check whether the load->alu latency can ever be 4 cycles.
Wilco
> Travis Downs (travis.downs.delete@this.gmail.com) on September 19, 2018 5:30 pm wrote:
> > anon (spam.delete.delete@this.this.spam.com) on September 19, 2018 2:45 am wrote:
> > > > >
> > > > > Then how do you explain the restriction? What prevents the use of the
> > > > > fast path with registers that weren't the result of an earlier load?
> > > >
> > > > Hardware doesn't move between pipelines. If we assume 4-cycle loads skip the initial complex address
> > > > calculation stage (and not a later stage), a 4-cycle load after a 5-cycle load must wait for a cycle
> > > > simply because the pipeline stages it needs are still being used by the earlier load.
> > > >
> > > > Wilco
> > > >
> > >
> > > Am I missing something obvious here?
> > > 4 cycle loads exist.
> > > What is the restriction that prevents them when the adress is the result of an ALU op instead of a load?
> >
> > The way I understood it is that if you mix 4 and 5 cycle loads, for example, in a "throughput" scenario,
> > your 4 cycle loads will often end up taking 5 cycles because
> > the are out of alignment with the 5 cycle loads
> > and use the same pipeline stages. In the example, the 4 cycle load can't start in the cycle after a 5 cycle
> > load because it wants the second part of the load pipeline which is what the 5 cycle load is using.
> >
> > So it turns into a 5 cycle load. It maybe gets even messier
> > if the skipped pipeline stages are somewhere in the middle.
> >
> > We do know that 4 cycle loads do play nice in a throughput scenario if there are only
> > 4 cycle loads around since 8 concurrent 4-cycle pointer chases do execute at 2 loads
> > per cycle. Maybe I could add some 5 cycle loads in there and see what happens.
>
> Yeah but even in the nice throughput scenario 4 cycle loads didn't happen with the adress
> coming from an ALU, right? So the different latency doesn't seem to be the problem.
The 4-cycle path might only work within the load/store unit because of timing. Forwarding within a unit is faster than from a different unit.
However it's not clear this is the case, you need sequences like alu->load5->load->alu to check whether the load->alu latency can ever be 4 cycles.
Wilco