By: Travis Downs (travis.downs.delete@this.gmail.com), September 18, 2018 11:07 am
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 10:53 am wrote:
>
> I meant that in all other scenarios, including stores, the latency should matter much less.
Well yes, for stores "latency" matters very little - to the point where it is hard to even define or test the "latency" for stores.
The best you can do is measure the store-forwarding latency, which on Skylake at least is even lower, in the best case, than an L1 hit at 3 cycles: probably because you can take the address calculate and TLB lookup out of the fast path entire: you just assume your load hit the store based on the low address bits and take a disambiguation speculation failure and flush if you are wrong.
> Everything is complex once you go into details. We can only work with simplifications here.
Sure, yes - but we should acknowledge the complexity or when simplifications are being made. So rather than saying "load latency doens't matter except for pointer chasing cases" we could say something like "load latency matters less in non-pointer chasing scenarios, but how much depends on the code".
>
> > That makes the choice of 2048 interesting, compared to say 1024. For offsets between 1024
> > and 2048, using the "uniformly random accesses" assumption you wouldn't expect the fast path
> > to pay off, on average. I have no doubt the number was carefully chosen however, probably
> > through simulation - so there might be some hidden effect that makes page crossing less common
> > than you'd expect - i.e., the "uniformly random" assumption may be wrong somehow.
> >
>
> Correct, if I'm not mistaken for uniformly random the chance of page crossings would
> be ~37%. If the displacements or memory positions are even slightly biased towards lower
> numbers you can easily get it below the 20%/~16.6% needed to make it worthwhile, especially
> if you halve the penalty on most cases where it does happen like SKL does.
Note that the penalty is not "halved" in Skylake except in the pathological case where every load has the different page behavior. In the "usual" case where only an occasional load has this behavior, the new behavior on Skylake slightly increases the penalty: you get a 6 cycle penalty (10 vs 4) for the naughty load, and an additional ~1 cycle penalty on the next load which takes the 5-cycle path rather than the 4-cycle path even though it probably could have taken the 4-cycle path.
So for this case where we talk about 20% naughty loads or whatever, we can't say the penalty is halved on Skylake (it is probably worse than Haswell).
ISTM the layout would have to be quite biased to get down to the crossover point.
> Oh I wasn't talking about anything complicated like that. After all just checking if the high bits
> are 0 was chosen because it's easy to implement, not because 2047 was the exact cutoff point. So
> just statistics as in "is adding an extra cycle to block once after a failure a net win" or if the
> added cycle exists for other reasons "is the cheapest method enough to fix the regression".
Agreed.
About 2048 vs other values, one would expect they would still try all the easy-to-check power-of-two values, e.g., via simulation.
>
> I meant that in all other scenarios, including stores, the latency should matter much less.
Well yes, for stores "latency" matters very little - to the point where it is hard to even define or test the "latency" for stores.
The best you can do is measure the store-forwarding latency, which on Skylake at least is even lower, in the best case, than an L1 hit at 3 cycles: probably because you can take the address calculate and TLB lookup out of the fast path entire: you just assume your load hit the store based on the low address bits and take a disambiguation speculation failure and flush if you are wrong.
> Everything is complex once you go into details. We can only work with simplifications here.
Sure, yes - but we should acknowledge the complexity or when simplifications are being made. So rather than saying "load latency doens't matter except for pointer chasing cases" we could say something like "load latency matters less in non-pointer chasing scenarios, but how much depends on the code".
>
> > That makes the choice of 2048 interesting, compared to say 1024. For offsets between 1024
> > and 2048, using the "uniformly random accesses" assumption you wouldn't expect the fast path
> > to pay off, on average. I have no doubt the number was carefully chosen however, probably
> > through simulation - so there might be some hidden effect that makes page crossing less common
> > than you'd expect - i.e., the "uniformly random" assumption may be wrong somehow.
> >
>
> Correct, if I'm not mistaken for uniformly random the chance of page crossings would
> be ~37%. If the displacements or memory positions are even slightly biased towards lower
> numbers you can easily get it below the 20%/~16.6% needed to make it worthwhile, especially
> if you halve the penalty on most cases where it does happen like SKL does.
Note that the penalty is not "halved" in Skylake except in the pathological case where every load has the different page behavior. In the "usual" case where only an occasional load has this behavior, the new behavior on Skylake slightly increases the penalty: you get a 6 cycle penalty (10 vs 4) for the naughty load, and an additional ~1 cycle penalty on the next load which takes the 5-cycle path rather than the 4-cycle path even though it probably could have taken the 4-cycle path.
So for this case where we talk about 20% naughty loads or whatever, we can't say the penalty is halved on Skylake (it is probably worse than Haswell).
ISTM the layout would have to be quite biased to get down to the crossover point.
> Oh I wasn't talking about anything complicated like that. After all just checking if the high bits
> are 0 was chosen because it's easy to implement, not because 2047 was the exact cutoff point. So
> just statistics as in "is adding an extra cycle to block once after a failure a net win" or if the
> added cycle exists for other reasons "is the cheapest method enough to fix the regression".
Agreed.
About 2048 vs other values, one would expect they would still try all the easy-to-check power-of-two values, e.g., via simulation.