By: Travis Downs (travis.downs.delete@this.gmail.com), September 18, 2018 9:39 am

Room: Moderated Discussions

anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 2:43 am wrote:

> Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.

Good question, I'll test it.

> It could also be bypass delay.

Yes, perhaps there is extra one cycle of delay from a result coming from the ALU and consumed by the ALU - but it doesn't

> Not sure what the likelihood of a store to a calculated adress being reloaded

> immediately is but I can't imagine it would be very high. Similarly for anything but a pointer chase the extra

> cycle for a load shouldn't matter when you've got other dependency chains that can execute in parallel.

I didn't follow what you meant by "store to a calculated address". Note that there are no stores to memory here, but perhaps you mean a different type of store.

About the latency not mattering other than pointer chasing, I think it is kind true, kind of not. We like to think of stuff in a binary way like "on the critical path, latency matters 100% and off the critical path, latency doens't matter at all", but in real-world code especially outside of small high-trip count loops the situation I guess could best be described as "complex".

Adding latency to a load, even if not in a pointer chasing scenario or part of a long dep chain still makes dependent operations available for execution later, which sometimes probably reduces ILP slightly, so in general you probably have some partial dependence on load latency even outside of pointer chasing, which is heavily dependent on the specific code. Maybe you see an average runtime impact of 0.2 cycles for every 1 cycle you increase the latency or something like that.

> Like Peter mentioned it also only attempts this for displacements less than 2048,

> so they probably did the math that this would get them a speedup on average.

> If you assumed the worst case of 2047 displacement and random adresses half the accesses would fall

> within the same page so at twice the latency for a replay it would be neutral. With the average

> displacement between 0 and 2047 usually being far lower than 2047 it should easily be a win.

I think something is off in the math there - yes, it's 2x the latency for a replay, but it's not zero latency for a successful fast path: you shave off only a cycle. So compared to the base case of 5 cycles, you are looking at a 4 (Haswell) or 5 (Skylake) cycle penalty versus a 1 cycle benefit. So the crossover point for Skylake is at 6 successful fast paths for every replay (ignoring the extra cost of the replay in terms of executed uops).

That makes the choice of 2048 interesting, compared to say 1024. For offsets between 1024 and 2048, using the "uniformly random accesses" assumption you wouldn't expect the fast path to pay off, on average. I have no doubt the number was carefully chosen however, probably through simulation - so there might be some hidden effect that makes page crossing less common than you'd expect - i.e., the "uniformly random" assumption may be wrong somehow.

>

> > I checked and AMD Ryzen doesn't have either of these weird behaviors: simple loads

> > are 4 cycles even if the address comes from an ALU operation, and they are still

> > 4 cycles when the base register and full address fall in different pages.

> >

> > The penalty on Skylake is 10 cycles, but in a tight made

> > to trigger the issue on every load you get only 7.5

> > cycles average: it seems that when the speculation failure

> > and replay happens, the next load will be a normal

> > 5-cycle load, but the one after that will speculate again

> > and try to execute in 4 cycles, so you end up with

> > alternating latencies of 10,5,10,5 which average out to 7.5. So the worst case is less bad in Skylake.

>

> All in one dependency chain? Again, Peter wrote that Skylake won't attempt the fast path if

> it failed for the previous load. There's probably statistics behind this choice as well.

Yes, the tests were in one dependency chain, although I'm not sure exactly how it works in practice (e.g., is the result itself tagged with the "replayed" tag or does the EU track it or something else). This was new for Skylake over Haswell, so maybe they wanted to partly remove the glass jaw that occurs if for some reason you got an interaction with an allocator such than many offsets cross a page. The slow-path loads increased from 9 to 10 cycles on Skylake so maybe they wanted to avoid any apparent regression in the worst case. Or perhaps cause and effect are reversed: maybe the extra one cycle in the replay case is actually used to implement the use-default-path-if-fast-path-failed behavior.

About statistics, sure - probably to decide whether to implement this at all. About how many loads to force into the 5-cycle path if a replay occurs, I'm not convinced that the value of 1 was chosen through stats: it's probably just an easy way to implement this, saving one 1 bit of state. If you wanted to be really sophisticated you could use an simple IP-based predictor that learns which loads tend to have page-crossing behavior - but we're really talking about a small fraction of a fraction of a cycle here: almost certainly not worth the complication.

> Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.

Good question, I'll test it.

> It could also be bypass delay.

Yes, perhaps there is extra one cycle of delay from a result coming from the ALU and consumed by the ALU - but it doesn't

*generally*show up: that is, except for this special 4-cycle case, the latencies are all the same whether the dependency chains pass through the ALU or not (in particular, a pointer chase with a complex addressing mode takes the expected number of cycles if you add an ALU op: 5 + 1 = 6 for a latency 1 operation).> Not sure what the likelihood of a store to a calculated adress being reloaded

> immediately is but I can't imagine it would be very high. Similarly for anything but a pointer chase the extra

> cycle for a load shouldn't matter when you've got other dependency chains that can execute in parallel.

I didn't follow what you meant by "store to a calculated address". Note that there are no stores to memory here, but perhaps you mean a different type of store.

About the latency not mattering other than pointer chasing, I think it is kind true, kind of not. We like to think of stuff in a binary way like "on the critical path, latency matters 100% and off the critical path, latency doens't matter at all", but in real-world code especially outside of small high-trip count loops the situation I guess could best be described as "complex".

Adding latency to a load, even if not in a pointer chasing scenario or part of a long dep chain still makes dependent operations available for execution later, which sometimes probably reduces ILP slightly, so in general you probably have some partial dependence on load latency even outside of pointer chasing, which is heavily dependent on the specific code. Maybe you see an average runtime impact of 0.2 cycles for every 1 cycle you increase the latency or something like that.

> Like Peter mentioned it also only attempts this for displacements less than 2048,

> so they probably did the math that this would get them a speedup on average.

> If you assumed the worst case of 2047 displacement and random adresses half the accesses would fall

> within the same page so at twice the latency for a replay it would be neutral. With the average

> displacement between 0 and 2047 usually being far lower than 2047 it should easily be a win.

I think something is off in the math there - yes, it's 2x the latency for a replay, but it's not zero latency for a successful fast path: you shave off only a cycle. So compared to the base case of 5 cycles, you are looking at a 4 (Haswell) or 5 (Skylake) cycle penalty versus a 1 cycle benefit. So the crossover point for Skylake is at 6 successful fast paths for every replay (ignoring the extra cost of the replay in terms of executed uops).

That makes the choice of 2048 interesting, compared to say 1024. For offsets between 1024 and 2048, using the "uniformly random accesses" assumption you wouldn't expect the fast path to pay off, on average. I have no doubt the number was carefully chosen however, probably through simulation - so there might be some hidden effect that makes page crossing less common than you'd expect - i.e., the "uniformly random" assumption may be wrong somehow.

>

> > I checked and AMD Ryzen doesn't have either of these weird behaviors: simple loads

> > are 4 cycles even if the address comes from an ALU operation, and they are still

> > 4 cycles when the base register and full address fall in different pages.

> >

> > The penalty on Skylake is 10 cycles, but in a tight made

> > to trigger the issue on every load you get only 7.5

> > cycles average: it seems that when the speculation failure

> > and replay happens, the next load will be a normal

> > 5-cycle load, but the one after that will speculate again

> > and try to execute in 4 cycles, so you end up with

> > alternating latencies of 10,5,10,5 which average out to 7.5. So the worst case is less bad in Skylake.

>

> All in one dependency chain? Again, Peter wrote that Skylake won't attempt the fast path if

> it failed for the previous load. There's probably statistics behind this choice as well.

Yes, the tests were in one dependency chain, although I'm not sure exactly how it works in practice (e.g., is the result itself tagged with the "replayed" tag or does the EU track it or something else). This was new for Skylake over Haswell, so maybe they wanted to partly remove the glass jaw that occurs if for some reason you got an interaction with an allocator such than many offsets cross a page. The slow-path loads increased from 9 to 10 cycles on Skylake so maybe they wanted to avoid any apparent regression in the worst case. Or perhaps cause and effect are reversed: maybe the extra one cycle in the replay case is actually used to implement the use-default-path-if-fast-path-failed behavior.

About statistics, sure - probably to decide whether to implement this at all. About how many loads to force into the 5-cycle path if a replay occurs, I'm not convinced that the value of 1 was chosen through stats: it's probably just an easy way to implement this, saving one 1 bit of state. If you wanted to be really sophisticated you could use an simple IP-based predictor that learns which loads tend to have page-crossing behavior - but we're really talking about a small fraction of a fraction of a cycle here: almost certainly not worth the complication.