By: Patrick Chase (patrickjchase.delete@this.gmai.com), July 7, 2015 10:49 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 7, 2015 9:23 am wrote:
> Patrick Chase (patrickjchase.delete@this.gmai.com) on July 6, 2015 10:31 pm wrote:
> > And yet you yourself have (very effectively, with real data) made the argument that
> > cmov seldom pays on x86.
>
> Absolutely. I think cmov a often a bad idea, because it leaves those data dependencies.
> And because it's often a bad idea, it's probably under-utilized in some cases (and also
> probably over-utilized in other cases).
Back when I was mentoring people who did a lot of DSP-ish coding I saw a common pattern: There would inevitably come a time when cmov/select was the right solution for a performance issue, so I would show them the appropriate idioms to convince the compiler to emit it (or intrinsic, or asm directive). Most of them would then go batsh*t crazy and use selects in all sorts of inappropriate places. Modern branch predictors are pretty good, and over-utilization ends up being the bigger problem in my experience.
> And that's part of my point - I think it would be interesting if hardware turned it
> into a predicted move, exactly to remove the data dependencies when there is a strong
> reason to believe it's the right thing to do. That's exactly the kind of information the
> CPU branch predictor already has (well, most of them do - not just predicting which way
> a branch goes, but also how likely it is).
>
> So hardware has the potential to offer the best of best worlds: keep the data
> dependency when it makes dynamic sense, and break it when it is likely the right thing
> to do.
>
> That's the kind of choice you can make at a hardware level. Doing it at the
> software level is really really problematic, for all the reasons outlined.
>
> See my argument?
Ah, understood. A couple thoughts:
1. In all but the most trivial cases the HW would also have to issue and execute all of the instructions on the not-predicted side of the branch. You might end up needing to do a hyperblock cache to mitigate impacts to I-fetch (i.e. an L0 I$ that stores pre-predicated hyperblocks, along the lines of but even less efficient than P4's superblock cache [*]).
2. Do existing branch predictors really have that information? I thought that even modern ones still tended to keep a fairly small amount of per-context state (where "context" is a catch-all for the branch address, histories, etc).
[*] It was never a trace cache, Intel's terminology notwithstanding. P4's L1 I$ contained single-entry, multi-exit EBBs, so the correct term is "superblock". Traces are multi-entry, multi-exit.
> That said, I also have to say that
>
> (a) I think cmov on x86 has improved. It used to have pretty bad latencies, afaik
> they've improved. So you still do have the data dependencies, but for many cases it
> probably doesn't matter that much.
Better but not free. 2 uops in Haswell/Broadwell.
> (b) there are clearly pretty big gotchas with using predictors too, and it may well be
> the case that it's not worth it. I wouldn't be surprised if this has been simulated, and
> real hw architects have come to the conclusion that the mispredicts just kill you.
>
> (c) since people and compilers have been taught to try to avoid cmov for well-predicted
> stuff, and some of those judgments are probably quite correct, the existing use may well
> be skewed enough towards "unpredictable" that the upsides are even smaller.
>
> So I'm certainly not claiming it's a no-brainer. I just think it would be interesting,
> and potentially something that hardware could do better (and it would allow software
> to maybe do better too, by making cmov more generically useful).
Yeah, agreed. The thing that I wrestle with here is that the easy-to-implement approach doesn't cover all that many real-world use cases IMO, while the general solution appears likely to be so expensive that it might be better to spend those gates on better predictors.
> Patrick Chase (patrickjchase.delete@this.gmai.com) on July 6, 2015 10:31 pm wrote:
> > And yet you yourself have (very effectively, with real data) made the argument that
> > cmov seldom pays on x86.
>
> Absolutely. I think cmov a often a bad idea, because it leaves those data dependencies.
> And because it's often a bad idea, it's probably under-utilized in some cases (and also
> probably over-utilized in other cases).
Back when I was mentoring people who did a lot of DSP-ish coding I saw a common pattern: There would inevitably come a time when cmov/select was the right solution for a performance issue, so I would show them the appropriate idioms to convince the compiler to emit it (or intrinsic, or asm directive). Most of them would then go batsh*t crazy and use selects in all sorts of inappropriate places. Modern branch predictors are pretty good, and over-utilization ends up being the bigger problem in my experience.
> And that's part of my point - I think it would be interesting if hardware turned it
> into a predicted move, exactly to remove the data dependencies when there is a strong
> reason to believe it's the right thing to do. That's exactly the kind of information the
> CPU branch predictor already has (well, most of them do - not just predicting which way
> a branch goes, but also how likely it is).
>
> So hardware has the potential to offer the best of best worlds: keep the data
> dependency when it makes dynamic sense, and break it when it is likely the right thing
> to do.
>
> That's the kind of choice you can make at a hardware level. Doing it at the
> software level is really really problematic, for all the reasons outlined.
>
> See my argument?
Ah, understood. A couple thoughts:
1. In all but the most trivial cases the HW would also have to issue and execute all of the instructions on the not-predicted side of the branch. You might end up needing to do a hyperblock cache to mitigate impacts to I-fetch (i.e. an L0 I$ that stores pre-predicated hyperblocks, along the lines of but even less efficient than P4's superblock cache [*]).
2. Do existing branch predictors really have that information? I thought that even modern ones still tended to keep a fairly small amount of per-context state (where "context" is a catch-all for the branch address, histories, etc).
[*] It was never a trace cache, Intel's terminology notwithstanding. P4's L1 I$ contained single-entry, multi-exit EBBs, so the correct term is "superblock". Traces are multi-entry, multi-exit.
> That said, I also have to say that
>
> (a) I think cmov on x86 has improved. It used to have pretty bad latencies, afaik
> they've improved. So you still do have the data dependencies, but for many cases it
> probably doesn't matter that much.
Better but not free. 2 uops in Haswell/Broadwell.
> (b) there are clearly pretty big gotchas with using predictors too, and it may well be
> the case that it's not worth it. I wouldn't be surprised if this has been simulated, and
> real hw architects have come to the conclusion that the mispredicts just kill you.
>
> (c) since people and compilers have been taught to try to avoid cmov for well-predicted
> stuff, and some of those judgments are probably quite correct, the existing use may well
> be skewed enough towards "unpredictable" that the upsides are even smaller.
>
> So I'm certainly not claiming it's a no-brainer. I just think it would be interesting,
> and potentially something that hardware could do better (and it would allow software
> to maybe do better too, by making cmov more generically useful).
Yeah, agreed. The thing that I wrestle with here is that the easy-to-implement approach doesn't cover all that many real-world use cases IMO, while the general solution appears likely to be so expensive that it might be better to spend those gates on better predictors.