The Benchmark Examiner

This is the first of what I intend to be a series of articles on benchmarking, benchmarking issues and benchmarks themselves. I will attempt to perform evaluations, talk with experts and manufacturers, and provide whatever information I can about the benchmarking industry and the various tools and techniques used, not only in computer publications, but within the industry itself. I don’t expect to uncover any subversive plots or present any shocking revelations, but I do expect to learn and present useful information and methods for testing and interpreting benchmark results from various products. I also hope that reviewers and industry professionals will take interest and provide their own views and feedback that I can incorporate into future articles.

One of the more interesting trends in the past few years has been the evolution of benchmarking on hardware enthusiast sites. Industry consortiums, such as SPEC, TPC and BAPCo have offered benchmarks for both high-end and low-end systems for a number of years, but until recently they were used primarily by manufacturers for marketing their products. Due to the costs and complexity associated with using these tools, most enthusiasts and publications opted for either free tools or in-house developed benchmarks. Ziff-Davis publications founded ZDBOp (now called eTesting Labs) in the early ’90s to provide such an alternative for hardware enthusiasts that was free, and therefore easily accessible to the general public. There have been various benchmarks developed and offered by the large computer publications and a few private companies, such as Wintune and others, but few of them have been accepted to any degree.

In 1996, when I first started paying attention to web based hardware publications some of the more popular sites used ‘ctcm’, a cache benchmark from C’t magazine, as well as the ZDBOp benchmarks, some benchmarks developed by Byte magazine, and a few games. Since that time we have seen publications use benchmarks from SPEC to games to customized freeware programs. As the competition between AMD and Intel has increased, and hardware reviewing has become more popular and lucrative, so has the focus upon benchmarks. While there have always been questions and concerns about the validity of the various tools, recently it has reached a fever pitch, with accusations of questionable activities being thrown about. Unfortunately, the major publications seem to be setting the standard for how benchmarks are used and interpreted, and that standard seems to me to be very low.

As someone who has performed benchmarks in a professional capacity while working in the IT departments of various companies, I have some background and interest in all of this. Several years ago I wrote an article on the subject outlining some of the basic issues with regards to benchmarking (unfortunately, some of the links are no longer functioning), and followed that up with a somewhat more critical piece earlier this year. Things have, in my opinion, not gotten any better, as publications:

  • Continue to use system level benchmarks to ‘prove’ the superiority of a single component over another when multiple components have been changed in the system (such as the chipset, processor and memory).
  • Use specially compiled or modified benchmark tools so that it is difficult, if not impossible, for independent verification of their results
  • Make grandiose claims about the superiority of one component over another when the results are generally within the margin of error, or only a small percentage above.
  • Include ratings for categories that they have done little or nothing to actually evaluate, such as quality, reliability, stability, etc.

A point that seems to have been lost on many is that benchmarks are simply an analysis tool, and most PC benchmarks are very general estimates of performance at best. I am reminded of a situation that occurred when I worked for (coincidentally enough) a large Japanese power tool manufacturer. Our processing was primarily batch oriented, where tens of thousands of transactions entered into the online systems during the day were run against the various data bases, spitting out packing slips, invoices, inventory orders, reports, etc. These batch jobs generated numerous large files that had to be sorted in various ways, so the sort routines were amongst the most heavily used in the shop. Our two candidates were IBM’s DFSORT and a product called SyncSort. In order to test both, I captured a large sample of files from several day’s worth of processing and ran them through both packages exactly as they would be during normal processing – same sort JCL and sort orders. SyncSort came out the winner by a large enough margin that we decided to go with that product. Our IBM representative was very disappointed, and offered to bring in a set of data to prove that DFSORT was faster – obviously failing to understand that what was important was how the product performed with our data. This is the ideal situation for benchmarking – a specific application being used with a specific set of data that the end user actually uses.

It should be obvious that the nature of PC benchmarking precludes the testing of a specific suite of applications using the customer’s own data. The cost to do so would be prohibitively high, though for large companies this methodology is still used (and paid for). Therefore, the PC benchmarks that have been developed are an attempt to approximate what the average user will see. Since the data and applications are usually based upon what most people run, by definition the tests do not apply to any specific individual’s situation. More importantly, it is probably safe to say that they only apply to any reasonable degree to about 60% or 70% of the user community. For everyone else they are no more than very rough indicators of performance. Those who attempt to make the numbers more meaningful are either under a great misconception, or are intentionally trying to mislead. Ironically, the PC benchmarks that come closest to the ideal are games, because each benchmark tests only one specific game and the resulting score can be directly interpreted and easily understood. However, games constitute only a small percentage of what PCs are generally used for, particularly in business where poor performance can actually cost money.

Another situation I recall is when IBM announced their RS6000 line of systems. IBM claimed that this architecture could actually operate on 5 instructions simultaneously under certain circumstances. Using a synthetic benchmark that ran a single task processing multiple transactions, it was on the order of three to five times faster than its competition, the HP 9000. However, independent tests showed it was actually no faster than the HP 9000 when running real world applications in a multi-tasking environment. The reason was that the overhead required to switch to a new task, and then switch back ate up all of the time saved in the individual tasks. This illustrates the danger of using synthetic benchmarks to ‘prove’ the superiority of one design over another, yet this is also done fairly frequently by various publications and marketing groups.

One issue that has made headlines recently are suggestions of bias and corruption in the BAPCo organization favoring Intel. Van Smith has been leading the charge on this one for awhile, amidst counter-accusations of bias, corruption and zealotry. While there appear to be some valid reasons to investigate this, some of the evidence being bandied about is purely circumstantial, such as the location of BAPCo headquarters and suspicion by individuals within various organizations. It seems to me that if we are trying to determine, as scientifically as possible, whether a given benchmark accurately reflects real world performance then we should only use techniques and evidence that will allow us to make that determination and forego the ad hominem attacks on companies or individuals.

This also applies to the arguments rebutting the accusations. Unsubstantiated claims of bias or corruption from either side do not constitute evidence of what the benchmarks are measuring, and serve only to obfuscate the real issues. These claims should only be reinforcement that we need to be as careful and objective as possible in any investigation. Evidence presented by anyone, biased or not, should be judged based upon its own merit, not the perceived objectivity (or lack of) by the accuser.

So, political issues aside, there is a very positive aspect to all of this hubbub, which is that the spotlight is now being focused upon the benchmarks. It has been my intent for quite some time now to peform evaluations on the benchmarks themselves to determine, as best as possible, what kind of conclusions we can reasonably make based on the test results. It seems to me that in order to use a tool effectively, one must understand what that tool is designed for. I can certainly drive in a screw with a hammer, but there are much better tools to use for that purpose, and the same can be said for diagnostic and analysis tools.

Van has suggested a Comprehensive Open Source Benchmark Initiative (COSBI) to create open source benchmarks that are ‘free’ from such influence. Though it seems to be an extremely difficult project, I think it deserves to be looked at and persued for whatever benefits it might provide. Certainly something must be done to resolve the issues being raised. After all, millions of dollars of consumer and corporate monies are potentially at stake, since the results of these benchmarks are used to make numerous selling and buying decisions.

While rumor and innuendo is certainly fun, it does little to help us gain understanding in the tools themselves – and this publication, at least, would like to provide the means for vendors and users to think for themselves rather than be led around by their emotions and ignorance. I know that other publications have similar desires, but thus far it seems that either the motivation or the understanding necessary to do it have been lacking. Recent events suggest that now is the time, and it is none too soon, in my opinion.

Be the first to discuss this article!