By: Wilco (Wilco.dijkstra.delete@this.ntlworld.com), September 18, 2018 2:37 pm
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 1:27 pm wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 12:29 pm wrote:
> > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 11:53 am wrote:
> > > Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 10:58 am wrote:
> > > > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 2:43 am wrote:
> > > >
> > > > > Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.
> > > >
> > > > Yes, it can - at least on SKL and IVB (the two archs I tested on).
> > > ...
> > > If throughput isn't the problem and it only happens when the loads immediately follow each
> > > other then it might be something different. Maybe it's skipping the TLB lookup altogether.
> >
> > I think it still needs the TLB lookup, and in fact the TLB lookup is still more or less on
> > the critical path since the addresses here are arbitrary and it needs the tag to select the
> > right way from the L1D set, whose access happens in parallel. I don't think any type of way
> > prediction is happening here since anyways the pointer chasing case doesn't lend itself to
> > it and also the 4-cycle latency is consistent even with "randomly" distributed addresses.
> >
>
> Then how do you explain the restriction? What prevents the use of the
> fast path with registers that weren't the result of an earlier load?
Hardware doesn't move between pipelines. If we assume 4-cycle loads skip the initial complex address calculation stage (and not a later stage), a 4-cycle load after a 5-cycle load must wait for a cycle simply because the pipeline stages it needs are still being used by the earlier load.
Wilco
> Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 12:29 pm wrote:
> > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 11:53 am wrote:
> > > Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 10:58 am wrote:
> > > > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 2:43 am wrote:
> > > >
> > > > > Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.
> > > >
> > > > Yes, it can - at least on SKL and IVB (the two archs I tested on).
> > > ...
> > > If throughput isn't the problem and it only happens when the loads immediately follow each
> > > other then it might be something different. Maybe it's skipping the TLB lookup altogether.
> >
> > I think it still needs the TLB lookup, and in fact the TLB lookup is still more or less on
> > the critical path since the addresses here are arbitrary and it needs the tag to select the
> > right way from the L1D set, whose access happens in parallel. I don't think any type of way
> > prediction is happening here since anyways the pointer chasing case doesn't lend itself to
> > it and also the 4-cycle latency is consistent even with "randomly" distributed addresses.
> >
>
> Then how do you explain the restriction? What prevents the use of the
> fast path with registers that weren't the result of an earlier load?
Hardware doesn't move between pipelines. If we assume 4-cycle loads skip the initial complex address calculation stage (and not a later stage), a 4-cycle load after a 5-cycle load must wait for a cycle simply because the pipeline stages it needs are still being used by the earlier load.
Wilco