By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), February 17, 2013 10:54 am
Room: Moderated Discussions
⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on February 17, 2013 10:49 am wrote:
> >
> > In contrast, multiple CC registers are just stupid.
>
> Isn't it true that the following code could run faster with 2 CC registers:
>
>
No. For multiple reasons.
(a) The above isn't nearly as common as the single test one, so even if it was faster, it's not worth worrying about. But it's not faster even for the combined case:
(b) multiple cc registers need extra bits in the instruction to specify which cc. End result: either you have huge instructions and your I$ sucks (Itanium), or more commonly you have one primary cc, and need special instructions for the other ones. In either case, it adds complexity.
(c) The above isn't nearly as common as the single test one, so even if it was faster, it's not worth worrying about. But it's not even faster.
(d) you can do it with a single CC and trivial optimizations. For example, in this case, you can do it with two subtracts and one 'or', and then checking the result of the 'or' for being zero. So your example could be trivially done with three instructions, and the 'or' just sets the CC bits implicitly.
(e) even though you can do it with pure arithmetic and just a single CC setting instruction (which is not uncommon), it's not even usually *worth* it. The obvious "just do two compares and two jumps" is likely the optimal code sequence, much better than trying to combine conditionals. Why? Branches usually predict well, and branch prediction makes conditional branches free (and even "better than free", because you can speculate and re-order across them, unlike data dependencies.
(f) multiple CC's means you also have to spend more effort on renaming them. x86 gets some of this with it's "single" register (it's really three sets of bits: the "status flags" end up being split into two by the fact that "Carry" is special with some instructions setting it but not others, and then you have the special system bits). But powerpc has the same exact problem, with its eight sets of 4-bit groups. So x86 has extra complexity for its own historical accidents, but the powerpc people added their stupidities on purpose!
Btw, that (e) thing is a big issue. Playing games with combined conditionals is usually just a huge waste of time - even if you were to save an instruction (which you generally don't), you do so at the cost of making the critical data-dependency chain longer and the code this slower.
Put another way: one point of combining conditionals is to remove conditional branches by replacing them with data flow, but conditional branches are just about the cheapest instruction you can have when they predict well. So not only do you need multiple conditionals to begin with, they all have to be unlikely! And please do note the "all" part. That's what makes the use of them such a spectacular failure. There simply aren't enough places for them to matter, and the places where they *do* matter can be written (admittedly often with slightly more complexity) with just a single CC and conditional moves/sets anyway.
End result: multiple CC's are just a f*cking stupid idea. Yes, you can find cases where they help, but they are much rarer than you'd think. The fact that branch prediction works really well is one primary downfall of them, but the thing is, even if you ignore that one, most of the time there is little disadvantage to a single CC, and arithmetic either on the sources (like in your very own example, as I showed you!) or on conditional moves will not be noticeably worse even with a single CC.
The only really valid reason for multiple CC seems to be "I'm a failed architecture, and I will do things in-order, and multiple CC's allow me to do lots of explicit scheduling in software by the compiler".
And quite frankly, anybody who believes that is a good reason for them should join the Itanium team and try to take over the world. At least the PowerPC people had *some* excuse for believing that 25 years ago. They were wrong too, but they didn't have tons of history to show them wrong.
Seriously. Multiple CC's is completely idiotic. Thinking that in-order machines and relying on the compiler to schedule instructions makes sense for performance is f*cking moronic.
Linus
> >
> > In contrast, multiple CC registers are just stupid.
>
> Isn't it true that the following code could run faster with 2 CC registers:
>
>
if a==b && c==d { ... }
No. For multiple reasons.
(a) The above isn't nearly as common as the single test one, so even if it was faster, it's not worth worrying about. But it's not faster even for the combined case:
(b) multiple cc registers need extra bits in the instruction to specify which cc. End result: either you have huge instructions and your I$ sucks (Itanium), or more commonly you have one primary cc, and need special instructions for the other ones. In either case, it adds complexity.
(c) The above isn't nearly as common as the single test one, so even if it was faster, it's not worth worrying about. But it's not even faster.
(d) you can do it with a single CC and trivial optimizations. For example, in this case, you can do it with two subtracts and one 'or', and then checking the result of the 'or' for being zero. So your example could be trivially done with three instructions, and the 'or' just sets the CC bits implicitly.
(e) even though you can do it with pure arithmetic and just a single CC setting instruction (which is not uncommon), it's not even usually *worth* it. The obvious "just do two compares and two jumps" is likely the optimal code sequence, much better than trying to combine conditionals. Why? Branches usually predict well, and branch prediction makes conditional branches free (and even "better than free", because you can speculate and re-order across them, unlike data dependencies.
(f) multiple CC's means you also have to spend more effort on renaming them. x86 gets some of this with it's "single" register (it's really three sets of bits: the "status flags" end up being split into two by the fact that "Carry" is special with some instructions setting it but not others, and then you have the special system bits). But powerpc has the same exact problem, with its eight sets of 4-bit groups. So x86 has extra complexity for its own historical accidents, but the powerpc people added their stupidities on purpose!
Btw, that (e) thing is a big issue. Playing games with combined conditionals is usually just a huge waste of time - even if you were to save an instruction (which you generally don't), you do so at the cost of making the critical data-dependency chain longer and the code this slower.
Put another way: one point of combining conditionals is to remove conditional branches by replacing them with data flow, but conditional branches are just about the cheapest instruction you can have when they predict well. So not only do you need multiple conditionals to begin with, they all have to be unlikely! And please do note the "all" part. That's what makes the use of them such a spectacular failure. There simply aren't enough places for them to matter, and the places where they *do* matter can be written (admittedly often with slightly more complexity) with just a single CC and conditional moves/sets anyway.
End result: multiple CC's are just a f*cking stupid idea. Yes, you can find cases where they help, but they are much rarer than you'd think. The fact that branch prediction works really well is one primary downfall of them, but the thing is, even if you ignore that one, most of the time there is little disadvantage to a single CC, and arithmetic either on the sources (like in your very own example, as I showed you!) or on conditional moves will not be noticeably worse even with a single CC.
The only really valid reason for multiple CC seems to be "I'm a failed architecture, and I will do things in-order, and multiple CC's allow me to do lots of explicit scheduling in software by the compiler".
And quite frankly, anybody who believes that is a good reason for them should join the Itanium team and try to take over the world. At least the PowerPC people had *some* excuse for believing that 25 years ago. They were wrong too, but they didn't have tons of history to show them wrong.
Seriously. Multiple CC's is completely idiotic. Thinking that in-order machines and relying on the compiler to schedule instructions makes sense for performance is f*cking moronic.
Linus