Following are the results on each platform. I’ve withheld the names of the indviduals, and simply used initials instead, except for the ‘winner’ who has given permission to do so. I’ll allow the rest identify themselves if they wish to on the forum:
The results of these tests proved to be much more interesting than I originally thought they would. There were several items that seemed to stand out to me.
- Most of the entries perform much better on the Athlon (relative to the original code) than the PIII and P4. One in particular runs very poorly on the PIII. This seems to indicate that some of the code optimizations used are very platform specific.
- The assembler version performs very well on the Athlon and P4 systems, but not quite as well (relative to the C version by the same contributor) on the PIII system. Again, this appears to be due to platform specific issues with the instruction sequence
- While the Intel V6 compiler (used by entrant RU) seems to produce much faster code on all platforms than the Microsoft compiler, it seems to have the greatest benefit on the PIII (almost a 50% performance difference). The performance difference on both P4 and K7 is around 30% vs. VC7. Also, the SP5 update of the VC6 compiler seems to provide a significant performance benefit over the original release, and is about on par with VC7 (my version, at least) for producing fast executables.
- Delphi doesn’t look very good at all here on any platform, being about 50% slower than the C version. At least it doesn’t appear to favor (or disfavor) any particular platform.
Based upon items #1 and #2, the idea that ‘non optimized’ code always runs better on PIII (vs. P4) seems to be challenged a bit here. Likely, however, the small size of the program allows it to take advantage of the very fast (though small) P4 trace cache.
There is something else that seems very interesting if we pull out only the results from the VC7 compiled modules. After sorting the results based on the Athlon’s performance, you can see, all of the results from contributors had increased performance on the Athlon, but not on the other platforms. In fact, it appears that the optimizations used by most of the contributors actually made performance worse for the PIII and P4.
Most contributors focused on trying to ‘optimize’ the instructions used in the main loop. Some used assembler to have greater control, while others tried to ‘outwit’ the compiler to force a more optimal instruction sequence. Only one really looked at the algorithm itself and tried to optimize it – which also happens to be the one that is the fastest on all platforms. Faster even than the pure assembler version.
Let’s take a quick look at what each contributor did to try and optimize the code. You can see the actual source, and the author’s comments in the Appendices of this article.
- TE (see source) decided to replace the C modulus and divide instructions with some inline assembler, in an attempt to generate fewer ‘slow’ instructions in the critical loop. He also changed the signed integer variables to unsigned, so the compiler would generate a few faster instructions. This gained him a bit on the Athlon, but actually worked against him on the PIII and P4.
- NB (see source) thought to eliminate the divide entirely by using 64-bit multiplication. Apparently this works fine with gcc, but VC7 doesn’t seem to generate the desired code. His own tests with gcc show up to a 35% improvement over the original. With VC7, only the Athlon platform seems to not mind his changes – but the PIII really doesn’t like them. Overall, probably the worst solution when using this compiler.
- CD (see source) also used unsigned integers, and tried to optimize the critical loop by using a ‘while’ instead of a ‘for’ and eliminating one iteration of the loop. This didn’t provide much benefit on the Athlon, but also didn’t penalize him on the other two platforms. In fact, this was the second best solution for the P4. Overall, a better cross-platform solution than the previous two.
- RU (see source) appears to have spent a great deal of effort to replace slower instructions with faster ones throughout the code, as well as ‘unrolling’ some of the loops. As you can see, this seems to have done wonders on the K7, but not for the PIII or P4. Primarily because of the large gain on the Athlon, this is a solid second place entry.
- PH (see source) – also known by many in the tech community as Paul Hsieh, beat all other submissions very handily. It turns out that while most of the others focused on trying to eke out the best performance by second guessing the compiler, Paul stepped back and looked at the actual algorithm. He recognized that the algorithm was designed to allow multiple digits to be generated simply by increasing the base number used (in this case from 10 to 10000). In other words, by making a relatively few changes (mostly to fix bugs when using the larger base) he improved performance by generating 4 digits at a time instead of one. Proof that there is no better optimizer than the human brain (currently, at least). And, it performs equally as well on all platforms. Interestingly, he performed none of the instruction optimizations that the others attempted (i.e., use of unsigned integers, etc.)
In conclusion, it seems that at least in this case, the programmer was significantly more influential in the raw performance of the resulting executable than the compiler. RU did extract some decent performance gains using ‘brute force’, which may be appropriate in many cases, but Paul showed us that before focusing on the instruction mix we should first optimize the algorithm to get maximum results. Most of the optimizations were not very good for cross-platform performance, but were seemingly somewhat platform specific. The exception to this, of course, was the optimization by Paul Hsieh, which provided an equal boost on all platforms. Again, this shows that the programmer can be more influential in the cross-platform performance than the compiler itself.
What this seems to illustrate is that simply ‘counting instructions’ is not necessarily the best optimization technique, and may end up making performance worse if you are not careful. This may be because the specific instruction(s) perform slightly differently on various platforms, or a different compiler may not be as ‘smart’ as the one you are used to. The example provided by NB is evidence of this, where a technique that apparently works well for PPC processors doesn’t work so well for x86, and PIII in particular (though the Athlon handles it reasonably well). In my own work, I have seen similar situations where very large performance gains were made by stepping back and rethinking the way the algorithm is implemented, after a number of other ‘instruction tweaking’ changes had been made with minimal effect.
I found it interesting that while virtually all contestants recognized that the loop was the critical bit of code, and that most of the execution time came down to only a few instructions, two slightly different approaches were taken to deal with that bottleneck. One group, as mentioned, replaced the offending instructions with faster ones (RU, TE and NB), while the other tried to reduce the number of times those instructions were executed (Paul Hsieh and CD). It could be said that RU won the ‘instruction replacement’ approach, while Paul won the ‘instruction reduction’ approach. It would be fun to see what effect a ‘fusion’ of both RU’s and Paul Hsieh’s approaches would have on performance, and we may be able to show this in the near future.
It may be that the optimizations by the others (particularly those made by RU) would provide a greater benefit after the algorithm is optimized, and in fact, the Michael Abrash quote at the beginning of the article is extracted from a chapter in his book that provides an example of this. In this book he graphically shows that tightening up the code can provide the best benefit after one has optimized the algorithm. While it is true that our example may have been somewhat unusual and rarely encountered in ‘real life’, this is also a good argument that it is important to recognize these opportunities, because this is what can give you the edge over the competition in the marketplace.
Discuss (16 comments)