By: Travis Downs (travis.downs.delete@this.gmail.com), September 18, 2018 3:08 pm
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 1:27 pm wrote:
> Then how do you explain the restriction? What prevents the use of the
> fast path with registers that weren't the result of an earlier load?
I don't know.
One theory would be that it is a true restriction, i.e., the hardware can't easily support the fast path, e.g,. because there is a dedicated path for loads to feed directly back into the load EU, or the timing otherwise doesn't work out if an ALU is involved. That's what I thought originally.
The other theory would be that it's totally possible to implement the fast path in general, at least for simple addressing, but that it necessarily doesn't pay off outside of pure pointer-chasing because the payoff is less direct, the latency penalties are still there and replays have a higher cost in that if you're trying to get 2 loads per cycle a replay definitely steals the EU for the second try at the load.
Also, as Wilco points out, once you start mixing latencies things become trickier, like both simple and complex addressing in the same loop: things might not work out well due to conflicts in the pipeline: including inside the load EUs and then for writeback/bypass etc. So if the opposite of "pointer chasing" is "load throughput" then maybe you are better of sticking with 5-cycle loads all the time.
I was partial to the first theory, but I think I'm being convinced the second is more likely.
It's clear that's it's possible to do 4 cycles unconditionally, at least with simple addressing, with an L1 cache in more or less the Intel style, since Ryzen does it and the cache is fairly similar (at least from the outside). The Intel L1D is a bit more capable in terms of misaligned loads and 256-bit loads though, so maybe that's where the trade-off lies, or maybe mixing different latencies really does suck.
> Then how do you explain the restriction? What prevents the use of the
> fast path with registers that weren't the result of an earlier load?
I don't know.
One theory would be that it is a true restriction, i.e., the hardware can't easily support the fast path, e.g,. because there is a dedicated path for loads to feed directly back into the load EU, or the timing otherwise doesn't work out if an ALU is involved. That's what I thought originally.
The other theory would be that it's totally possible to implement the fast path in general, at least for simple addressing, but that it necessarily doesn't pay off outside of pure pointer-chasing because the payoff is less direct, the latency penalties are still there and replays have a higher cost in that if you're trying to get 2 loads per cycle a replay definitely steals the EU for the second try at the load.
Also, as Wilco points out, once you start mixing latencies things become trickier, like both simple and complex addressing in the same loop: things might not work out well due to conflicts in the pipeline: including inside the load EUs and then for writeback/bypass etc. So if the opposite of "pointer chasing" is "load throughput" then maybe you are better of sticking with 5-cycle loads all the time.
I was partial to the first theory, but I think I'm being convinced the second is more likely.
It's clear that's it's possible to do 4 cycles unconditionally, at least with simple addressing, with an L1 cache in more or less the Intel style, since Ryzen does it and the cache is fairly similar (at least from the outside). The Intel L1D is a bit more capable in terms of misaligned loads and 256-bit loads though, so maybe that's where the trade-off lies, or maybe mixing different latencies really does suck.