# You can do two 4-cycle loads per cycle

By: anon (spam.delete.delete@this.this.spam.com), September 20, 2018 4:35 am
Wilco (Wilco.dijkstra.delete@this.ntlworld.com) on September 20, 2018 2:32 am wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 20, 2018 1:34 am wrote:
> > Travis Downs (travis.downs.delete@this.gmail.com) on September 19, 2018 5:30 pm wrote:
> > > anon (spam.delete.delete@this.this.spam.com) on September 19, 2018 2:45 am wrote:
> > > > > >
> > > > > > Then how do you explain the restriction? What prevents the use of the
> > > > > > fast path with registers that weren't the result of an earlier load?
> > > > >
> > > > > Hardware doesn't move between pipelines. If we assume 4-cycle loads skip the initial complex address
> > > > > calculation stage (and not a later stage), a 4-cycle load after a 5-cycle load must wait for a cycle
> > > > > simply because the pipeline stages it needs are still being used by the earlier load.
> > > > >
> > > > > Wilco
> > > > >
> > > >
> > > > Am I missing something obvious here?
> > > > 4 cycle loads exist.
> > > > What is the restriction that prevents them when the adress is the result of an ALU op instead of a load?
> > >
> > > The way I understood it is that if you mix 4 and 5 cycle loads, for example, in a "throughput" scenario,
> > > your 4 cycle loads will often end up taking 5 cycles because
> > > the are out of alignment with the 5 cycle loads
> > > and use the same pipeline stages. In the example, the 4 cycle load can't start in the cycle after a 5 cycle
> > > load because it wants the second part of the load pipeline which is what the 5 cycle load is using.
> > >
> > > So it turns into a 5 cycle load. It maybe gets even messier
> > > if the skipped pipeline stages are somewhere in the middle.
> > >
> > > We do know that 4 cycle loads do play nice in a throughput scenario if there are only
> > > 4 cycle loads around since 8 concurrent 4-cycle pointer chases do execute at 2 loads
> > > per cycle. Maybe I could add some 5 cycle loads in there and see what happens.
> >
> > Yeah but even in the nice throughput scenario 4 cycle loads didn't happen with the adress
> > coming from an ALU, right? So the different latency doesn't seem to be the problem.
>
> The 4-cycle path might only work within the load/store unit because of timing.
> Forwarding within a unit is faster than from a different unit.
>

Ah, that's what you meant. Still that is completely unrelated to 5 cycle loads blocking the fast path for the following load.

> However it's not clear this is the case, you need sequences like alu->load5->load->alu
> to check whether the load->alu latency can ever be 4 cycles.
>
> Wilco

Honestly I'm not sure if that could prove anything. It depends on when and in which cycle the forwarding happens. If it's at the start of the cycle then it would be understable that far forward + fast path could run into time trouble while intra-unit forward + fast path doesn't. However it could also mean that the far forward to the ALU happens in the ALU's execution cycle, which might have enough leeway to do it just fine.
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM