By: anon (spam.delete.delete@this.this.spam.com), September 19, 2018 2:45 am
Room: Moderated Discussions
Wilco (Wilco.dijkstra.delete@this.ntlworld.com) on September 18, 2018 2:37 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 1:27 pm wrote:
> > Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 12:29 pm wrote:
> > > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 11:53 am wrote:
> > > > Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 10:58 am wrote:
> > > > > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 2:43 am wrote:
> > > > >
> > > > > > Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.
> > > > >
> > > > > Yes, it can - at least on SKL and IVB (the two archs I tested on).
> > > > ...
> > > > If throughput isn't the problem and it only happens when the loads immediately follow each
> > > > other then it might be something different. Maybe it's skipping the TLB lookup altogether.
> > >
> > > I think it still needs the TLB lookup, and in fact the TLB lookup is still more or less on
> > > the critical path since the addresses here are arbitrary and it needs the tag to select the
> > > right way from the L1D set, whose access happens in parallel. I don't think any type of way
> > > prediction is happening here since anyways the pointer chasing case doesn't lend itself to
> > > it and also the 4-cycle latency is consistent even with "randomly" distributed addresses.
> > >
> >
> > Then how do you explain the restriction? What prevents the use of the
> > fast path with registers that weren't the result of an earlier load?
>
> Hardware doesn't move between pipelines. If we assume 4-cycle loads skip the initial complex address
> calculation stage (and not a later stage), a 4-cycle load after a 5-cycle load must wait for a cycle
> simply because the pipeline stages it needs are still being used by the earlier load.
>
> Wilco
>
Am I missing something obvious here?
4 cycle loads exist.
What is the restriction that prevents them when the adress is the result of an ALU op instead of a load?
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 1:27 pm wrote:
> > Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 12:29 pm wrote:
> > > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 11:53 am wrote:
> > > > Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 10:58 am wrote:
> > > > > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 2:43 am wrote:
> > > > >
> > > > > > Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.
> > > > >
> > > > > Yes, it can - at least on SKL and IVB (the two archs I tested on).
> > > > ...
> > > > If throughput isn't the problem and it only happens when the loads immediately follow each
> > > > other then it might be something different. Maybe it's skipping the TLB lookup altogether.
> > >
> > > I think it still needs the TLB lookup, and in fact the TLB lookup is still more or less on
> > > the critical path since the addresses here are arbitrary and it needs the tag to select the
> > > right way from the L1D set, whose access happens in parallel. I don't think any type of way
> > > prediction is happening here since anyways the pointer chasing case doesn't lend itself to
> > > it and also the 4-cycle latency is consistent even with "randomly" distributed addresses.
> > >
> >
> > Then how do you explain the restriction? What prevents the use of the
> > fast path with registers that weren't the result of an earlier load?
>
> Hardware doesn't move between pipelines. If we assume 4-cycle loads skip the initial complex address
> calculation stage (and not a later stage), a 4-cycle load after a 5-cycle load must wait for a cycle
> simply because the pipeline stages it needs are still being used by the earlier load.
>
> Wilco
>
Am I missing something obvious here?
4 cycle loads exist.
What is the restriction that prevents them when the adress is the result of an ALU op instead of a load?