By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), February 17, 2013 1:47 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on February 15, 2013 6:34 pm wrote:
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on February 15, 2013 5:46 pm wrote:
> >
> > That's a good tradeoff, but you need 3 extra bits on compares of course
>
> It's worse than that. You'll want to do subtract-and-test, bitops, yadda yadda. Compare really isn't even close
> to complete. Otherwise you'll just end up playing games with other arithmetic-logical ops to then be able to
> compare the result. Which is *not* an improvement, no matter how some people would want to twist it,
One condition bit for arithmetic and logical operations is enough, more than 90% of the time you only need a zero/non-zero result. You only need special opcodes for add with carry, but then again you need those anyway.
> The whole "decrement loop counter and test" is common enough to merit special logic
> (sometimes to the point of having special loop registers, like powerpc).
A decrement register and branch if non-zero is useful indeed. However I don't believe the special counters like POWER are useful at all, they don't seem to provide any gain while requiring lots of unnecessary extra instructions. If you make a call inside a loop you actually have to save/restore that register via GPR moves IIRC... That's just useless.
> The thing is, the traditional "just have a flags register" really is hard to beat.
> People have tried, and people have inevitably failed. Not a single architecture
> has ever done better than that. I don't understand why people try so hard.
I wouldn't call MIPS or Alpha bad or failed - both have very good branch instructions. In the old days MIPS gained a lot vs ARM from its single cycle compare&branch instructions (delayed branches also played a role of course but don't gain you anything once you have branch prediction). So I don't get why you think condition codes are so much better.
> And no, reducing the flag register to a single bit isn't a win either.
>
> Seriously, why would it be? It just shifts the decoding bits around by a tiny amount, and not in a good
> way. The only reason to think it's an advantage is if you have an overly restrictive decoder, so that
> for some particular instruction encoding the "shift bits around" can be seen as a good thing.
Yes, reducing condition bits can be a good tradeoff for encodings. Having multiple bits is more general of course, however most ISAs manage to turn that into a disadvantage by having many instructions only update a subset of the flags...
> But while it can be a win for some particular encoding model, it's a *bad* thing in general.
> It just makes it much harder to do all the things people use flags for. Yes, that includes
> carry, but it also includes "test against zero and negative at the same time" etc etc.
>
> People really do that - small case-statements turn into "test against an exact match of one
> value, and larger-than and smaller-than AT THE SAME TIME". And it's totally natural and correct
> to do with a flags register, and anybody who disputes that is just in denial.
The switch test case is an example of the higher generality of multiple condition bits, but that's the only case where you'd ever see 2 branches after one compare. Most condition tricks are only useful when writing highly optimized assembly code. In practice for compiled code you'd hardly notice a difference between 1 or 4 condition bits.
> So "one bit of flags" is just a sign of working around particular bitpattern encoding issues rather than "it's
> actually a good thing". It's a *bad* thing. You want at least three bits for the conditionals probably four
> or five bits of flags total (depending a bit on whether you care to encode overflow separately etc).
Again, in compiled code you wouldn't notice the difference.
> There's a reason a lot of architectures used flags and made it possible to set them in most arithmetic
> ops.
Condition codes are a historial artifact. Ever noticed how every ISA defines its condition bits slightly differently? It's what the very first ALU implementation happened to do. That's why some instructions only update some flags based on the value of a register...
>It's flexible and powerful, and you really do want them for most arithmetic ops. And doing it unconditionally
> for pretty much everything (like x86 does) is an actual encoding space advantage, even if some people will
> have a hard time admitting that, and will want to have the extra "enable" bit for it.
Having every instruction set condition flags is a very bad idea, particularly if different instructions set different subsets of flags. I talked to the designers of Cortex-A8 at the time and had a hard time convincing them to add all the extra hardware to execute 2 flag setting ALU instructions per cycle and merge the flag results without any latency. The complexity of the ALUs was a reason why the A8 was late. But if they hadn't done it, Thumb-2 code would have been ~10% slower and that would have destroyed the goal of Thumb-2 being as fast as ARM but with Thumb-1 codesize.
So if you do want condition codes then you really do need to add a bit on every arithmetic instruction. That way an implementation can decide to only execute one flag setting instruction in parallel without a penalty. And you get the obvious scheduling benefits of course.
Overall for modern OoO CPUs I believe combined compare&branch is best.
Wilco
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on February 15, 2013 5:46 pm wrote:
> >
> > That's a good tradeoff, but you need 3 extra bits on compares of course
>
> It's worse than that. You'll want to do subtract-and-test, bitops, yadda yadda. Compare really isn't even close
> to complete. Otherwise you'll just end up playing games with other arithmetic-logical ops to then be able to
> compare the result. Which is *not* an improvement, no matter how some people would want to twist it,
One condition bit for arithmetic and logical operations is enough, more than 90% of the time you only need a zero/non-zero result. You only need special opcodes for add with carry, but then again you need those anyway.
> The whole "decrement loop counter and test" is common enough to merit special logic
> (sometimes to the point of having special loop registers, like powerpc).
A decrement register and branch if non-zero is useful indeed. However I don't believe the special counters like POWER are useful at all, they don't seem to provide any gain while requiring lots of unnecessary extra instructions. If you make a call inside a loop you actually have to save/restore that register via GPR moves IIRC... That's just useless.
> The thing is, the traditional "just have a flags register" really is hard to beat.
> People have tried, and people have inevitably failed. Not a single architecture
> has ever done better than that. I don't understand why people try so hard.
I wouldn't call MIPS or Alpha bad or failed - both have very good branch instructions. In the old days MIPS gained a lot vs ARM from its single cycle compare&branch instructions (delayed branches also played a role of course but don't gain you anything once you have branch prediction). So I don't get why you think condition codes are so much better.
> And no, reducing the flag register to a single bit isn't a win either.
>
> Seriously, why would it be? It just shifts the decoding bits around by a tiny amount, and not in a good
> way. The only reason to think it's an advantage is if you have an overly restrictive decoder, so that
> for some particular instruction encoding the "shift bits around" can be seen as a good thing.
Yes, reducing condition bits can be a good tradeoff for encodings. Having multiple bits is more general of course, however most ISAs manage to turn that into a disadvantage by having many instructions only update a subset of the flags...
> But while it can be a win for some particular encoding model, it's a *bad* thing in general.
> It just makes it much harder to do all the things people use flags for. Yes, that includes
> carry, but it also includes "test against zero and negative at the same time" etc etc.
>
> People really do that - small case-statements turn into "test against an exact match of one
> value, and larger-than and smaller-than AT THE SAME TIME". And it's totally natural and correct
> to do with a flags register, and anybody who disputes that is just in denial.
The switch test case is an example of the higher generality of multiple condition bits, but that's the only case where you'd ever see 2 branches after one compare. Most condition tricks are only useful when writing highly optimized assembly code. In practice for compiled code you'd hardly notice a difference between 1 or 4 condition bits.
> So "one bit of flags" is just a sign of working around particular bitpattern encoding issues rather than "it's
> actually a good thing". It's a *bad* thing. You want at least three bits for the conditionals probably four
> or five bits of flags total (depending a bit on whether you care to encode overflow separately etc).
Again, in compiled code you wouldn't notice the difference.
> There's a reason a lot of architectures used flags and made it possible to set them in most arithmetic
> ops.
Condition codes are a historial artifact. Ever noticed how every ISA defines its condition bits slightly differently? It's what the very first ALU implementation happened to do. That's why some instructions only update some flags based on the value of a register...
>It's flexible and powerful, and you really do want them for most arithmetic ops. And doing it unconditionally
> for pretty much everything (like x86 does) is an actual encoding space advantage, even if some people will
> have a hard time admitting that, and will want to have the extra "enable" bit for it.
Having every instruction set condition flags is a very bad idea, particularly if different instructions set different subsets of flags. I talked to the designers of Cortex-A8 at the time and had a hard time convincing them to add all the extra hardware to execute 2 flag setting ALU instructions per cycle and merge the flag results without any latency. The complexity of the ALUs was a reason why the A8 was late. But if they hadn't done it, Thumb-2 code would have been ~10% slower and that would have destroyed the goal of Thumb-2 being as fast as ARM but with Thumb-1 codesize.
So if you do want condition codes then you really do need to add a bit on every arithmetic instruction. That way an implementation can decide to only execute one flag setting instruction in parallel without a penalty. And you get the obvious scheduling benefits of course.
Overall for modern OoO CPUs I believe combined compare&branch is best.
Wilco