By: anon (spam.delete.delete@this.this.spam.com), September 19, 2018 2:42 am
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 3:08 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 1:27 pm wrote:
>
> > Then how do you explain the restriction? What prevents the use of the
> > fast path with registers that weren't the result of an earlier load?
>
> I don't know.
>
> One theory would be that it is a true restriction, i.e., the hardware can't easily support the fast
> path, e.g,. because there is a dedicated path for loads to feed directly back into the load EU, or
> the timing otherwise doesn't work out if an ALU is involved. That's what I thought originally.
>
> The other theory would be that it's totally possible to implement the fast path in general, at least for
> simple addressing, but that it necessarily doesn't pay off outside of pure pointer-chasing because the payoff
> is less direct, the latency penalties are still there and replays have a higher cost in that if you're trying
> to get 2 loads per cycle a replay definitely steals the EU for the second try at the load.
>
Or the displacement vs page crossing statistics are vastly different depending on the source.
Have you tested disp8 vs disp32 for ALU sources?
> Also, as Wilco points out, once you start mixing latencies things become trickier, like both simple and
> complex addressing in the same loop: things might not work out well due to conflicts in the pipeline:
> including inside the load EUs and then for writeback/bypass etc. So if the opposite of "pointer chasing"
> is "load throughput" then maybe you are better of sticking with 5-cycle loads all the time.
>
> I was partial to the first theory, but I think I'm being convinced the second is more likely.
>
> It's clear that's it's possible to do 4 cycles unconditionally, at least with simple addressing, with
> an L1 cache in more or less the Intel style, since Ryzen does it and the cache is fairly similar (at least
> from the outside). The Intel L1D is a bit more capable in terms of misaligned loads and 256-bit loads though,
> so maybe that's where the trade-off lies, or maybe mixing different latencies really does suck.
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 1:27 pm wrote:
>
> > Then how do you explain the restriction? What prevents the use of the
> > fast path with registers that weren't the result of an earlier load?
>
> I don't know.
>
> One theory would be that it is a true restriction, i.e., the hardware can't easily support the fast
> path, e.g,. because there is a dedicated path for loads to feed directly back into the load EU, or
> the timing otherwise doesn't work out if an ALU is involved. That's what I thought originally.
>
> The other theory would be that it's totally possible to implement the fast path in general, at least for
> simple addressing, but that it necessarily doesn't pay off outside of pure pointer-chasing because the payoff
> is less direct, the latency penalties are still there and replays have a higher cost in that if you're trying
> to get 2 loads per cycle a replay definitely steals the EU for the second try at the load.
>
Or the displacement vs page crossing statistics are vastly different depending on the source.
Have you tested disp8 vs disp32 for ALU sources?
> Also, as Wilco points out, once you start mixing latencies things become trickier, like both simple and
> complex addressing in the same loop: things might not work out well due to conflicts in the pipeline:
> including inside the load EUs and then for writeback/bypass etc. So if the opposite of "pointer chasing"
> is "load throughput" then maybe you are better of sticking with 5-cycle loads all the time.
>
> I was partial to the first theory, but I think I'm being convinced the second is more likely.
>
> It's clear that's it's possible to do 4 cycles unconditionally, at least with simple addressing, with
> an L1 cache in more or less the Intel style, since Ryzen does it and the cache is fairly similar (at least
> from the outside). The Intel L1D is a bit more capable in terms of misaligned loads and 256-bit loads though,
> so maybe that's where the trade-off lies, or maybe mixing different latencies really does suck.