Is It Good, Bad, or Ugly?
For the third time in less than a week, the Pentium 4 has been reviewed on Tom’s Hardware Guide. The first evaluation concluded that the P4 is a poor choice for current applications, but has some promise in bandwidth intensive applications. The second review slammed the P4 harder, claiming that it doesn’t even do well in MPEG-4 decoding using a ‘freeware’ utility, called FlasK, so it is only good for playing Quake3. This latest review reverses the last decision, after the code was recompiled with Intel’s current compiler, and also modified to include several different types of code optimizations.
In this latest article, the comment is made that evaluating the P4 ‘…is more difficult than with any other product that I have evaluated before’, which makes good sense since it is not only just a new processor, but a new micro-architecture as well! Evaluating a processor that is simply faster, or has only some minor process improvements is one thing, but comparing two different architectures is something else again. To their credit, THG has recognized some of the shortfalls of the first two evaluations, and appear to have embarked upon a process to revisit other benchmarks as well. Now, AMD Zone is recompiling the code and optimizing the FlasK code for the Athlon, claiming that this beats the scores THG was able to achieve for the P4 with the optimized code. This is actually interesting and indicates just how much difference a proper compiler can make regarding performance, but we should also ask why this is only now being considered by publications many people have looked to as experts in computer performance evaluation.
What is Being Tested, and Why?
It seems to me that there are at least two possible intentions when a product is being evaluated. One is to actually evaluate the product itself in terms of current usability, feature set, performance, etc. while another would be to evaluate the technologies implemented in the product and compare them to previous technologies and implementations to determine their potential.
The first type of evaluation is somewhat straightforward, though time-consuming (if properly done). Assuming the proper diagnostic tools and benchmarks are used and understood, it is not difficult to make a conclusion that the product is either good or bad when compared against competing products. The main issue here is that the tools, such as benchmarks, must be understood. If one does not know what a particular is actually doing, it is obviously not possible to come to any conclusions about the results!
If I may rant for a moment, I find it disturbing that in many cases publications spend only a few hours evaluating products for this purpose – just enough time to run a half-dozen benchmarks, count the capacitors/resistors/whatever, and read the manual. The resulting reviews include such categories as ‘stability’, ‘quality’, ‘compatibility’ and other equally impressive sounding terms that would be virtually impossible to determine without spending weeks testing under various loads, operating systems and system designs. A few months ago, RWT decided to no longer perform product evaluations because it is difficult to compete with review jockeys and maintain a high-quality evaluation.
The second type of evaluation is much more difficult, because it requires that the technologies being evaluated are well understood. With this type of review, it is sometimes necessary to modify existing tools, or invent new ones, to show what the technology is capable of, since existing tools will likely not utilize previously non-existent technologies. This can be difficult and time consuming even for experienced engineers, which many reviewers are not. Conclusions need to be appropriate for the type of tests performed, whether extensive or limited in scope – which means that the reviewer needs to understand the limitations of the tools being used.
It seems apparent to me that the Pentium 4 reviews, and the resultant controversies, highlights a problem whereby product vs. technology evaluations get confused by the general public and even the professional reviewers. In many cases, benchmarks were run simply for the purpose of running benchmarks, and conclusions were drawn based upon those without any supporting arguments – so the reader has to either accept or reject them with no firm foundation. As stated earlier, if you don’t know what your tools are measuring, how can you come to any reasonable conclusions?
For an example of a good evaluation, take a look at the Ace’s Hardware Pentium 4 review, and notice that each benchmark is accompanied by a description of what that benchmark is actually testing. This way, the reader can see why the conclusions are made, and can either accept or reject them based upon the perceived validity of the benchmark evaluation.
Walk, or Ride the Bandwagon
Shortly after the publication of the myriad of Pentium 4 reviews, the AMD faithful began to rally around the Athlon, proclaiming it the once and future King. Since Paul DeMone had written technical analyses that indicated the performance might be better than expected, and that recompiling/optimizing code could very well result in large performance gains, some of the more militant began to cast dispersions towards Paul and insisted that he confess his sins against AMD. When he expressed surprise at the poor results under current benchmarks, but maintained his stand that the P4 architecture has much promise, things started to get a little nasty, particularly after Tom’s Hardware Guide published their first update saying the P4 performed worse using ‘current’ MPEG-4 code than previously thought.
The more recent information seems to indicate that Paul’s evaluation was closer to the truth than some were willing to admit, though it certainly doesn’t mean that vendors will rush out to optimize/recompile their products. The Pentium 4 does still appear to be a poor value compared to other options using current applications, but that does not negate the fact that there are some applications which can benefit greatly by being optimized.
Bandwagons are very easy to jump onto. When everyone around you is saying the same thing, it can make you feel that you have a very strong position, which makes it easy to shout out your opinions. What is difficult is to buck the trends, and take a close look at the issue from an objective viewpoint.
Let’s Get Back to Basics
The fact that reviewers are now making the effort to investigate the implications of optimizing code for a particular technology is a very positive step. The question we should be asking is why many of those looked upon as experts in performance evaluation were unaware of these issues? As should be obvious – using Ziff-Davis’ benchmark executables from last year (or anyone else’s) simply is not acceptable for evaluating new technology, yet these are the kinds of applications used by some reviewers declare the Pentium 4 ‘worthless’. It may not be a good choice for current applications, but this type of evaluation is inappropriate for making conclusions about the future potential of the technology.
In the future, it would be nice to see the large (and small) review sites working more closely with the developers of the product, as well as their competition, to make sure that all possible bases have been covered. It would also be an excellent idea to consult with independent computer architects who can provide hints and suggestions about how one might be able to evaluate the technologies, perhaps even supplying some sample code. Either the benchmark organizations need to release their products in tandem with the new products (and not just one company’s) or they need to provide source code so that the reviewers can compile them, just as SPEC does.
It is unlikely that this specific situation will cause the industry to change in any appreciable manner, but hopefully the general public will recognize the many shortcomings of the current process many publications use, and will view the analyses with a much more critical eye. THG benefited from having some very knowledgeable people provide feedback and assistance in bringing some of these issues to light. Let’s hope that all of this is not lost on the ‘hardware review’ industry and the public at large.
Be the first to discuss this article!