By: anon (spam.delete.delete@this.this.spam.com), September 18, 2018 10:53 am
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 9:39 am wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 2:43 am wrote:
>
> > Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.
>
> Good question, I'll test it.
>
> > It could also be bypass delay.
>
> Yes, perhaps there is extra one cycle of delay from a result coming from the ALU and consumed by the ALU - but
> it doesn't generally show up: that is, except for this special 4-cycle case, the latencies are all the same
> whether the dependency chains pass through the ALU or not (in particular, a pointer chase with a complex addressing
> mode takes the expected number of cycles if you add an ALU op: 5 + 1 = 6 for a latency 1 operation).
>
> > Not sure what the likelihood of a store to a calculated adress being reloaded
> > immediately is but I can't imagine it would be very high.
> > Similarly for anything but a pointer chase the extra
> > cycle for a load shouldn't matter when you've got other dependency chains that can execute in parallel.
>
> I didn't follow what you meant by "store to a calculated address". Note that there
> are no stores to memory here, but perhaps you mean a different type of store.
>
I meant that in all other scenarios, including stores, the latency should matter much less.
> About the latency not mattering other than pointer chasing, I think it is kind true, kind of not.
> We like to think of stuff in a binary way like "on the critical path, latency matters 100% and
> off the critical path, latency doens't matter at all", but in real-world code especially outside
> of small high-trip count loops the situation I guess could best be described as "complex".
>
Everything is complex once you go into details. We can only work with simplifications here.
> Adding latency to a load, even if not in a pointer chasing scenario or part of a long dep chain
> still makes dependent operations available for execution later, which sometimes probably reduces
> ILP slightly, so in general you probably have some partial dependence on load latency even outside
> of pointer chasing, which is heavily dependent on the specific code. Maybe you see an average runtime
> impact of 0.2 cycles for every 1 cycle you increase the latency or something like that.
>
> > Like Peter mentioned it also only attempts this for displacements less than 2048,
> > so they probably did the math that this would get them a speedup on average.
> > If you assumed the worst case of 2047 displacement and random adresses half the accesses would fall
> > within the same page so at twice the latency for a replay it would be neutral. With the average
> > displacement between 0 and 2047 usually being far lower than 2047 it should easily be a win.
>
> I think something is off in the math there - yes, it's 2x the latency for a replay, but it's
> not zero latency for a successful fast path: you shave off only a cycle. So compared to the
> base case of 5 cycles, you are looking at a 4 (Haswell) or 5 (Skylake) cycle penalty versus
> a 1 cycle benefit. So the crossover point for Skylake is at 6 successful fast paths for every
> replay (ignoring the extra cost of the replay in terms of executed uops).
>
Yeah I cut out too much in an effort to condense it.
> That makes the choice of 2048 interesting, compared to say 1024. For offsets between 1024
> and 2048, using the "uniformly random accesses" assumption you wouldn't expect the fast path
> to pay off, on average. I have no doubt the number was carefully chosen however, probably
> through simulation - so there might be some hidden effect that makes page crossing less common
> than you'd expect - i.e., the "uniformly random" assumption may be wrong somehow.
>
Correct, if I'm not mistaken for uniformly random the chance of page crossings would be ~37%. If the displacements or memory positions are even slightly biased towards lower numbers you can easily get it below the 20%/~16.6% needed to make it worthwhile, especially if you halve the penalty on most cases where it does happen like SKL does.
> >
> > > I checked and AMD Ryzen doesn't have either of these weird behaviors: simple loads
> > > are 4 cycles even if the address comes from an ALU operation, and they are still
> > > 4 cycles when the base register and full address fall in different pages.
> > >
> > > The penalty on Skylake is 10 cycles, but in a tight made
> > > to trigger the issue on every load you get only 7.5
> > > cycles average: it seems that when the speculation failure
> > > and replay happens, the next load will be a normal
> > > 5-cycle load, but the one after that will speculate again
> > > and try to execute in 4 cycles, so you end up with
> > > alternating latencies of 10,5,10,5 which average out to 7.5. So the worst case is less bad in Skylake.
> >
> > All in one dependency chain? Again, Peter wrote that Skylake won't attempt the fast path if
> > it failed for the previous load. There's probably statistics behind this choice as well.
>
> Yes, the tests were in one dependency chain, although I'm not sure exactly how it works in practice
> (e.g., is the result itself tagged with the "replayed" tag or does the EU track it or something else).
> This was new for Skylake over Haswell, so maybe they wanted to partly remove the glass jaw that occurs
> if for some reason you got an interaction with an allocator such than many offsets cross a page. The
> slow-path loads increased from 9 to 10 cycles on Skylake so maybe they wanted to avoid any apparent
> regression in the worst case. Or perhaps cause and effect are reversed: maybe the extra one cycle in
> the replay case is actually used to implement the use-default-path-if-fast-path-failed behavior.
>
> About statistics, sure - probably to decide whether to implement this at all. About how many
> loads to force into the 5-cycle path if a replay occurs, I'm not convinced that the value of
> 1 was chosen through stats: it's probably just an easy way to implement this, saving one 1 bit
> of state. If you wanted to be really sophisticated you could use an simple IP-based predictor
> that learns which loads tend to have page-crossing behavior - but we're really talking about
> a small fraction of a fraction of a cycle here: almost certainly not worth the complication.
>
Oh I wasn't talking about anything complicated like that. After all just checking if the high bits are 0 was chosen because it's easy to implement, not because 2047 was the exact cutoff point. So just statistics as in "is adding an extra cycle to block once after a failure a net win" or if the added cycle exists for other reasons "is the cheapest method enough to fix the regression".
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 2:43 am wrote:
>
> > Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.
>
> Good question, I'll test it.
>
> > It could also be bypass delay.
>
> Yes, perhaps there is extra one cycle of delay from a result coming from the ALU and consumed by the ALU - but
> it doesn't generally show up: that is, except for this special 4-cycle case, the latencies are all the same
> whether the dependency chains pass through the ALU or not (in particular, a pointer chase with a complex addressing
> mode takes the expected number of cycles if you add an ALU op: 5 + 1 = 6 for a latency 1 operation).
>
> > Not sure what the likelihood of a store to a calculated adress being reloaded
> > immediately is but I can't imagine it would be very high.
> > Similarly for anything but a pointer chase the extra
> > cycle for a load shouldn't matter when you've got other dependency chains that can execute in parallel.
>
> I didn't follow what you meant by "store to a calculated address". Note that there
> are no stores to memory here, but perhaps you mean a different type of store.
>
I meant that in all other scenarios, including stores, the latency should matter much less.
> About the latency not mattering other than pointer chasing, I think it is kind true, kind of not.
> We like to think of stuff in a binary way like "on the critical path, latency matters 100% and
> off the critical path, latency doens't matter at all", but in real-world code especially outside
> of small high-trip count loops the situation I guess could best be described as "complex".
>
Everything is complex once you go into details. We can only work with simplifications here.
> Adding latency to a load, even if not in a pointer chasing scenario or part of a long dep chain
> still makes dependent operations available for execution later, which sometimes probably reduces
> ILP slightly, so in general you probably have some partial dependence on load latency even outside
> of pointer chasing, which is heavily dependent on the specific code. Maybe you see an average runtime
> impact of 0.2 cycles for every 1 cycle you increase the latency or something like that.
>
> > Like Peter mentioned it also only attempts this for displacements less than 2048,
> > so they probably did the math that this would get them a speedup on average.
> > If you assumed the worst case of 2047 displacement and random adresses half the accesses would fall
> > within the same page so at twice the latency for a replay it would be neutral. With the average
> > displacement between 0 and 2047 usually being far lower than 2047 it should easily be a win.
>
> I think something is off in the math there - yes, it's 2x the latency for a replay, but it's
> not zero latency for a successful fast path: you shave off only a cycle. So compared to the
> base case of 5 cycles, you are looking at a 4 (Haswell) or 5 (Skylake) cycle penalty versus
> a 1 cycle benefit. So the crossover point for Skylake is at 6 successful fast paths for every
> replay (ignoring the extra cost of the replay in terms of executed uops).
>
Yeah I cut out too much in an effort to condense it.
> That makes the choice of 2048 interesting, compared to say 1024. For offsets between 1024
> and 2048, using the "uniformly random accesses" assumption you wouldn't expect the fast path
> to pay off, on average. I have no doubt the number was carefully chosen however, probably
> through simulation - so there might be some hidden effect that makes page crossing less common
> than you'd expect - i.e., the "uniformly random" assumption may be wrong somehow.
>
Correct, if I'm not mistaken for uniformly random the chance of page crossings would be ~37%. If the displacements or memory positions are even slightly biased towards lower numbers you can easily get it below the 20%/~16.6% needed to make it worthwhile, especially if you halve the penalty on most cases where it does happen like SKL does.
> >
> > > I checked and AMD Ryzen doesn't have either of these weird behaviors: simple loads
> > > are 4 cycles even if the address comes from an ALU operation, and they are still
> > > 4 cycles when the base register and full address fall in different pages.
> > >
> > > The penalty on Skylake is 10 cycles, but in a tight made
> > > to trigger the issue on every load you get only 7.5
> > > cycles average: it seems that when the speculation failure
> > > and replay happens, the next load will be a normal
> > > 5-cycle load, but the one after that will speculate again
> > > and try to execute in 4 cycles, so you end up with
> > > alternating latencies of 10,5,10,5 which average out to 7.5. So the worst case is less bad in Skylake.
> >
> > All in one dependency chain? Again, Peter wrote that Skylake won't attempt the fast path if
> > it failed for the previous load. There's probably statistics behind this choice as well.
>
> Yes, the tests were in one dependency chain, although I'm not sure exactly how it works in practice
> (e.g., is the result itself tagged with the "replayed" tag or does the EU track it or something else).
> This was new for Skylake over Haswell, so maybe they wanted to partly remove the glass jaw that occurs
> if for some reason you got an interaction with an allocator such than many offsets cross a page. The
> slow-path loads increased from 9 to 10 cycles on Skylake so maybe they wanted to avoid any apparent
> regression in the worst case. Or perhaps cause and effect are reversed: maybe the extra one cycle in
> the replay case is actually used to implement the use-default-path-if-fast-path-failed behavior.
>
> About statistics, sure - probably to decide whether to implement this at all. About how many
> loads to force into the 5-cycle path if a replay occurs, I'm not convinced that the value of
> 1 was chosen through stats: it's probably just an easy way to implement this, saving one 1 bit
> of state. If you wanted to be really sophisticated you could use an simple IP-based predictor
> that learns which loads tend to have page-crossing behavior - but we're really talking about
> a small fraction of a fraction of a cycle here: almost certainly not worth the complication.
>
Oh I wasn't talking about anything complicated like that. After all just checking if the high bits are 0 was chosen because it's easy to implement, not because 2047 was the exact cutoff point. So just statistics as in "is adding an extra cycle to block once after a failure a net win" or if the added cycle exists for other reasons "is the cheapest method enough to fix the regression".