By: anon (spam.delete.delete@this.this.spam.com), September 19, 2018 2:40 am
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 1:52 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 11:51 am wrote:
> >
> > This was about the subset where you've got an ALU op before the store. Like I said I can't imagine
> > many cases where you'd need such a complex adress calculation that the normal adressing modes don't
> > cut it and then hit the same adress immediately afterwards such that the bypass latency from ALU
> > to AGU would matter. Of course it's not bypass delay but that was my idea at the time.
>
> Right, but I'm not sure why we are talking about stores here and combining
> "address comes from ALU" with "stores". If you're saying that cases that involve
> stores probably aren't usually sensitive to store latency, then sure.
>
Read the last sentence. This was just about whether or not Intel could afford having a bypass delay for adresses to the AGUs. For stores I don't expect the latency to be part of the critical path and pointer chases can use the fast path.
> What I'm saying is that the interesting cases are: pointer chasing (no stores) without
> an ALU op and pointer chasing (no stores) that have an ALU op in the dependency chain.
>
> Although no ALU op is probably more common in that case (especially on x86 with relatively powerful addressing
> modes), I don't think the latter case is particularly rare: ALU ops show up in in the address calculation
> path in real code when using tagged or compressed pointers, when using indexes into a structure where
> indexed addressing modes don't cut it (structure size not a power of two or greater than 8), in cases
> where the pointer is selected based on a conditional move or similar mechanism, etc.
>
> > That's why I wrote "Similarly for anything but a pointer chase the extra cycle for a load shouldn't matter
> > when you've got other dependency chains that can execute in parallel." and not "latency doesn't matter".
> > It's literally "shouldn't matter for certain code".
>
> OK, I parsed it differently because of where you said for anything but a pointer chase, so I read it as:
>
> Similarly for anything but a pointer chase the extra cycle for a load shouldn't matter [because
> in this case] you've got other dependency chains that can execute in parallel.
>
> I understand now that you were saying:
>
> If you've got enough other work to execute in parallel, then the latency doesn't matter.
>
> Sure, in that case we agree! It leaves a hole though since
> a lot of code won't fit cleanly into either category.
>
> > > > Correct, if I'm not mistaken for uniformly random the chance of page crossings would
> > > > be ~37%. If the displacements or memory positions are even slightly biased towards lower
> > > > numbers you can easily get it below the 20%/~16.6% needed to make it worthwhile, especially
> > > > if you halve the penalty on most cases where it does happen like SKL does.
> > >
> > > Note that the penalty is not "halved" in Skylake except in the pathological case where every load
> > > has the different page behavior. In the "usual" case where only an occasional load has this behavior,
> > > the new behavior on Skylake slightly increases the penalty: you get a 6 cycle penalty (10 vs 4) for
> > > the naughty load, and an additional ~1 cycle penalty on the next load which takes the 5-cycle path
> > > rather than the 4-cycle path even though it probably could have taken the 4-cycle path.
> > >
> > > So for this case where we talk about 20% naughty loads or whatever, we can't
> > > say the penalty is halved on Skylake (it is probably worse than Haswell).
> > >
> > > ISTM the layout would have to be quite biased to get down to the crossover point.
> > >
> >
> > Please don't be condescending just because you don't understand the argument.
>
> Huh? I re-read the above and either you misread me or have an extremely low bar for condescension.
> It certainly wasn't my intent to come across that way, nor did I have the feeling of anything
> other than a frank technical discussion. Please read my comments more charitably!
>
You immediately default to explaining the same thing over and over again. If you asked for clarification or told me I suck at explaining my argument it'd be fine, I'd just have to post a more detailed explanation, but this forces me to read throw the same thing over and over again each time longer and "easier to understand" than the last and it's somewhere between annoying and insulting.
>
> > Low displacements will rarely cross pages so the penalty isn't really a concern.
> > The pathological case and 1024-2047 displacement is what you should be worried about.
>
> Yes, I agree it's mostly not a problem. Most displacements are very small in practice so the penalty usually
> won't occur and is probably in the noise on most code (probably why this has gone unreported for so long).
>
> I was mostly just discussing the choice of 2048 as the cutover point versus 1024. I'm not saying
> that a cutover point of 2048 is worse than not having the 4-cycle fast path at all, I'm saying
> that a back of the envelope calculation would indicate that 1024 is a better cross-over point
> and since I doubt my "back of the envelope" calculation is better than the combined might of Intel's
> simulation, there must be assumption wrong in the assumptions for my calculation.
>
Correct, which means that reality must be quite a bit different than random displacements.
> Even if you assume lower offsets are much more common, it doesn't make 2048 better: you have to assume also
> that pointers tend to be more often in the first half of the page, and that's what I find fairly unlikely.
>
That's wrong. 4096 displacement will always lead to a page crossing whereas a displacement of 1 can only case a page crossing for 1 in 4096 pointers.
> > Most failures won't happen on the randomly distributed and/or low displacements. When you're
> > near the end of a page or have multiple high displacements in last half or quarter that's
> > where most of them will come from and the blocking does halve the penalty for those.
> > You agreed that in those cases the penalty is halved so if those are most then it's a clear win.
>
> The new Skylake behavior helps in the worst case when you have many back-to-back different page
> addresses, i.e., they are bursty/highly correlated. It doesn't help if the offsets are large and
> the address distributions are more or less random, so the different page case is more or less
> randomly distributed, unless the probability is very high (i.e., approaches worst case).
>
> If you run some numbers you should see that with the "uniform distribution non-bursty" assumption the new
> behavior cuts down on the penalty for high offsets, but not anywhere close to half.
Like I said I think that most of the penalties will happen in bursty cases, but feel free to show some numbers that say otherwise.
If you want to argue against a strawman "in the exact opposite case that's not true" feel free to do so as well.
> If the pointer chasing
> tends to be highly correlated with respect to within-page position I could see it helping a lot more. Allocators,
> especially when allocating new memory, tend to hand addresses more or less linearly, so if you just allocated
> a big linked list and don't re-arrange the node you would definitely see correlated behavior.
>
> >
> > Of course you can assume instead that someone at Intel decided to put in extra circuitry
> > just for fun even though it's a loss, but that's not the assumption I'm working with.
>
> Where are you getting that impression? Which circuitry are you taking about? The 4-cycle fast
> path, or the new Skylake behavior of doing a 5-cycle load if the last load was replayed? The
> 4-cycle fast path is an obvious big win to me. The 5-cycle Skylake thing is also an obvious
> big win in worst-case and highly correlated cases so I think it also makes sense.
>
The forced normal path after replay.
I'm not sure what your problem is. You're simultaneously arguing against highly correlated cases being common, which would mean that the SKL behaviour is a loss, while arguing that it makes sense because it's "an obvious big win in worst case and highly correlated cases".
Make up your mind. It can't be a big win and at the same time not be a big win.
> > > About 2048 vs other values, one would expect they would still try
> > > all the easy-to-check power-of-two values, e.g., via simulation.
> >
> > For 512-1023 it's ~19% even if it were uniform and for 2048-4095
> > it's ~75% so it was always going to be either 1024 or 2048.
>
> Well sure, if the various biasing were stronger or weaker, or the Skylake-like mitigations
> changed the calculations, you could easily get a smaller number or 4096.
>
> If you just assume totally uniform, then 512 seems to maybe come out on top in a Haswell-like chip.
No, the optimum is ~820, 1024 is slightly better than 512.
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 11:51 am wrote:
> >
> > This was about the subset where you've got an ALU op before the store. Like I said I can't imagine
> > many cases where you'd need such a complex adress calculation that the normal adressing modes don't
> > cut it and then hit the same adress immediately afterwards such that the bypass latency from ALU
> > to AGU would matter. Of course it's not bypass delay but that was my idea at the time.
>
> Right, but I'm not sure why we are talking about stores here and combining
> "address comes from ALU" with "stores". If you're saying that cases that involve
> stores probably aren't usually sensitive to store latency, then sure.
>
Read the last sentence. This was just about whether or not Intel could afford having a bypass delay for adresses to the AGUs. For stores I don't expect the latency to be part of the critical path and pointer chases can use the fast path.
> What I'm saying is that the interesting cases are: pointer chasing (no stores) without
> an ALU op and pointer chasing (no stores) that have an ALU op in the dependency chain.
>
> Although no ALU op is probably more common in that case (especially on x86 with relatively powerful addressing
> modes), I don't think the latter case is particularly rare: ALU ops show up in in the address calculation
> path in real code when using tagged or compressed pointers, when using indexes into a structure where
> indexed addressing modes don't cut it (structure size not a power of two or greater than 8), in cases
> where the pointer is selected based on a conditional move or similar mechanism, etc.
>
> > That's why I wrote "Similarly for anything but a pointer chase the extra cycle for a load shouldn't matter
> > when you've got other dependency chains that can execute in parallel." and not "latency doesn't matter".
> > It's literally "shouldn't matter for certain code".
>
> OK, I parsed it differently because of where you said for anything but a pointer chase, so I read it as:
>
> Similarly for anything but a pointer chase the extra cycle for a load shouldn't matter [because
> in this case] you've got other dependency chains that can execute in parallel.
>
> I understand now that you were saying:
>
> If you've got enough other work to execute in parallel, then the latency doesn't matter.
>
> Sure, in that case we agree! It leaves a hole though since
> a lot of code won't fit cleanly into either category.
>
> > > > Correct, if I'm not mistaken for uniformly random the chance of page crossings would
> > > > be ~37%. If the displacements or memory positions are even slightly biased towards lower
> > > > numbers you can easily get it below the 20%/~16.6% needed to make it worthwhile, especially
> > > > if you halve the penalty on most cases where it does happen like SKL does.
> > >
> > > Note that the penalty is not "halved" in Skylake except in the pathological case where every load
> > > has the different page behavior. In the "usual" case where only an occasional load has this behavior,
> > > the new behavior on Skylake slightly increases the penalty: you get a 6 cycle penalty (10 vs 4) for
> > > the naughty load, and an additional ~1 cycle penalty on the next load which takes the 5-cycle path
> > > rather than the 4-cycle path even though it probably could have taken the 4-cycle path.
> > >
> > > So for this case where we talk about 20% naughty loads or whatever, we can't
> > > say the penalty is halved on Skylake (it is probably worse than Haswell).
> > >
> > > ISTM the layout would have to be quite biased to get down to the crossover point.
> > >
> >
> > Please don't be condescending just because you don't understand the argument.
>
> Huh? I re-read the above and either you misread me or have an extremely low bar for condescension.
> It certainly wasn't my intent to come across that way, nor did I have the feeling of anything
> other than a frank technical discussion. Please read my comments more charitably!
>
You immediately default to explaining the same thing over and over again. If you asked for clarification or told me I suck at explaining my argument it'd be fine, I'd just have to post a more detailed explanation, but this forces me to read throw the same thing over and over again each time longer and "easier to understand" than the last and it's somewhere between annoying and insulting.
>
> > Low displacements will rarely cross pages so the penalty isn't really a concern.
> > The pathological case and 1024-2047 displacement is what you should be worried about.
>
> Yes, I agree it's mostly not a problem. Most displacements are very small in practice so the penalty usually
> won't occur and is probably in the noise on most code (probably why this has gone unreported for so long).
>
> I was mostly just discussing the choice of 2048 as the cutover point versus 1024. I'm not saying
> that a cutover point of 2048 is worse than not having the 4-cycle fast path at all, I'm saying
> that a back of the envelope calculation would indicate that 1024 is a better cross-over point
> and since I doubt my "back of the envelope" calculation is better than the combined might of Intel's
> simulation, there must be assumption wrong in the assumptions for my calculation.
>
Correct, which means that reality must be quite a bit different than random displacements.
> Even if you assume lower offsets are much more common, it doesn't make 2048 better: you have to assume also
> that pointers tend to be more often in the first half of the page, and that's what I find fairly unlikely.
>
That's wrong. 4096 displacement will always lead to a page crossing whereas a displacement of 1 can only case a page crossing for 1 in 4096 pointers.
> > Most failures won't happen on the randomly distributed and/or low displacements. When you're
> > near the end of a page or have multiple high displacements in last half or quarter that's
> > where most of them will come from and the blocking does halve the penalty for those.
> > You agreed that in those cases the penalty is halved so if those are most then it's a clear win.
>
> The new Skylake behavior helps in the worst case when you have many back-to-back different page
> addresses, i.e., they are bursty/highly correlated. It doesn't help if the offsets are large and
> the address distributions are more or less random, so the different page case is more or less
> randomly distributed, unless the probability is very high (i.e., approaches worst case).
>
> If you run some numbers you should see that with the "uniform distribution non-bursty" assumption the new
> behavior cuts down on the penalty for high offsets, but not anywhere close to half.
Like I said I think that most of the penalties will happen in bursty cases, but feel free to show some numbers that say otherwise.
If you want to argue against a strawman "in the exact opposite case that's not true" feel free to do so as well.
> If the pointer chasing
> tends to be highly correlated with respect to within-page position I could see it helping a lot more. Allocators,
> especially when allocating new memory, tend to hand addresses more or less linearly, so if you just allocated
> a big linked list and don't re-arrange the node you would definitely see correlated behavior.
>
> >
> > Of course you can assume instead that someone at Intel decided to put in extra circuitry
> > just for fun even though it's a loss, but that's not the assumption I'm working with.
>
> Where are you getting that impression? Which circuitry are you taking about? The 4-cycle fast
> path, or the new Skylake behavior of doing a 5-cycle load if the last load was replayed? The
> 4-cycle fast path is an obvious big win to me. The 5-cycle Skylake thing is also an obvious
> big win in worst-case and highly correlated cases so I think it also makes sense.
>
The forced normal path after replay.
I'm not sure what your problem is. You're simultaneously arguing against highly correlated cases being common, which would mean that the SKL behaviour is a loss, while arguing that it makes sense because it's "an obvious big win in worst case and highly correlated cases".
Make up your mind. It can't be a big win and at the same time not be a big win.
> > > About 2048 vs other values, one would expect they would still try
> > > all the easy-to-check power-of-two values, e.g., via simulation.
> >
> > For 512-1023 it's ~19% even if it were uniform and for 2048-4095
> > it's ~75% so it was always going to be either 1024 or 2048.
>
> Well sure, if the various biasing were stronger or weaker, or the Skylake-like mitigations
> changed the calculations, you could easily get a smaller number or 4096.
>
> If you just assume totally uniform, then 512 seems to maybe come out on top in a Haswell-like chip.
No, the optimum is ~820, 1024 is slightly better than 512.