Defining Meaningful Benchmarks
In order to have meaningful conclusions drawn from a benchmark test, several ‘rules’ need to be followed. There are a number of web sites which provide information on what constitutes good benchmarking, what benchmarks are available and their general usefulness, and even how to design your own benchmark. Links to some of these sites are provided at the end of this article.
In order to illustrate exactly how improper benchmark testing can result in an incorrect conclusion, Robert Collins of x86.org wrote an article where he ‘proved’ that a Pentium 166 MMX system outperformed a Pentium II 300 system, simply by carefully choosing the video card and hard drives for each system.
The first rule is to have a clear objective (what is the purpose of the test) and an understanding of the component or system is being tested. The objective may be to determine which CPU is better for game play, what hard drive is better for data base performance, or some other specific purpose. It is necessary to recognize that different applications use the system resources differently, and therefore will be affected differently by various changes to the system.
Traditionally, computers are broken down into three main elements: central processor, memory subsystem and I/O subsystem. With the recent trend towards graphics oriented designs, it is not unreasonable to break things down a bit further to include a graphics subsystem. Knowing which of these elements your application uses is critical in evaluating whether a particular benchmark is applicable to your usage of the computer. Speeding up an element that is not the bottleneck will provide little or no improvement in throughput.
For example, AIM Technology has some articles on their web site that explain some of the performance issues for benchmarks. One of these illustrates how a data base application requires very high I/O bandwidth vs. a spreadsheet or word processing program that utilizes the CPU more. In a series of charts, they show that increasing the CPU speed provides no benefit for the data base applications, while it improves the ‘office application’ performance greatly. This same discussion shows that adding additional hard drives provides absolutely no benefit to office programs, yet the data base applications get a very large performance boost.
The second rule is to understand what the benchmark actually tests, and how the final score is derived. In order to decide which benchmark serves our needs, we need to look at all of the available options, research what tests or applications they use, determine what is not being tested and also what assumptions were made by the designers. We also need to take our intended audience into consideration and limit our tests to those that are most applicable.
There are two main types of benchmarks: Component level and System level. Component level benchmarks (also called ‘synthetic’ benchmarks) attempt to isolate one specific component (i.e., CPU, motherboard, video card, hard drive, etc.) and test all of the various features in order to evaluate it’s speed and efficiency. While this may be useful for academic purposes, the relevance to real world operation is questionable, however, these are the currently the best way of isolating (as much as possible) individual components for direct comparison. System level benchmarks are used to measure overall throughput for a given system. In this case, real applications are generally used with a series of keystrokes and actions that are supposed to emulate a real user (called Application benchmarks).
Synthetic benchmarks require more knowledge and analysis in order to come to meaningful conclusions, because they often are impacted by other system elements. For example, CPUMark32 (Winbench component test) shows that a Pentium II is much faster than a Celeron of the same speed. The apparent reason for this is that the test was designed before the Celeron was available, and therefore overflows the smaller (128KB) L2 cache, giving erroneous results! CPUMark99 was quickly released to correct this problem. Without this understanding, it would be easy for someone to assume from the CPUMark32 results that the Celeron core is inferior to the Pentium II, when they are actually identical.
System level application benchmarks are simpler to understand, and are therefore most often used by hardware review and benchmark comparison sites (usually Winstone is used). What must be recognized is that this type of benchmark is almost useless in measuring the differences between individual components, unless very strict controls are in place and all hardware/software differences can be explicitly identified and those differences quantified. Too many times reviewers (even those who should know better) will compare a Socket 7 system and a Slot 1 system with a system level benchmark, then proclaim that one processor is faster than the other based upon the results. It should be fairly obvious that many factors besides the processors will impact the system-level results.
The third rule is to avoid reporting only the calculated average when multiple tests are performed, and to attempt to use as many different benchmarks as possible, unless you can show that the average score and specific benchmark is all that is necessary to achieve your objective. If only an ‘average’ score is given, we cannot really know whether the system or component was better or worse than another in any specific application area. This may be important to individuals who don’t use every type of office application, for example. If only some of the available tests were run, it is important to identify why certain ones were chosen. It is possible that the reviewer believes that a particular test is irrelevant due to a misunderstanding of what is being tested, and readers would be deprived of the opportunity to evaluate the validity of the conclusions.
An extension of this rule would be to avoid making broad generalizations based upon the benchmark results. For example, Quake is used extensively to test 3D gaming performance. While this benchmark is obviously perfect for evaluating a systems ability to run Quake as well as other games using the Quake engine, it is not as good for evaluating performance for other game engines. Many also point to Quake results to ‘prove’ that the Intel FPU is superior in all ways to those implemented by AMD and Cyrix, even though there are cirumstances where the AMD and Cyrix FPUs are superior. By making claims that are not justified by the information available, the credibility of the entire evaluation comes into question.
Be the first to discuss this article!