A ‘Real World’ Example
To illustrate the problems in running a proper benchmark test, let us assume that we want to compare two processors to determine which one is ‘faster’. We want to have a valid test and conclusion, so we need to follow the rules given in the last section. We therefore need to identify our objective, choose which benchmarks will provide us with the information we are seeking, report the results of all tests within each benchmark, analyze the results and present a conclusion. The components being tested in this theoretical test will be two processors: an AMD K2-400 and a Pentium II 400.
Our objective will be to compare the performance difference between these two processors using office applications, which includes mostly integer performance (word processing, light spreadsheet), some floating point (spreadsheet calculations), some data base searching and 2D graphics (graphs, charts, etc.).
Now we need to investigate what benchmarks are available, which applications are being used by the benchmark, and what relevance they have toward our objective. Two common application suite benchmarks are Ziff-Davis’ Winstone 98 and BAPCo SysMark32. For this test, we decide upon the Winstone 98 tests because they are free and more commonly used. Ziff-Davis does publish each application in their suite, so we can determine whether these will suffice for our purposes, and as it turns out they have chosen the most common office apps.
We also can determine what the basic assumptions were in the design by reading their own description at http://www8.zdnet.com/pcmag/pclabs/bench/labnotes/notew98b.htm. Note that the weighting they have assigned to each application may not reflect their importance to any given user. For this reason, we also need to determine if the benchmark reports each applications score, groups the applications and presents an average or simply reports a single average for all tests. In this case, applications are grouped by function (word processing, data base, spreadsheet, etc.) and scores are averaged within each group. Each of these is then averaged to come up with a final score.
To be as accurate as possible, we should use the group scores rather than the overall average score. This will allow us to better analyze the relative performance of each processor depending upon how CPU intensive, memory intensive, graphics intensive or I/O intensive each type of application is. This is important because a very high score in one application group could skew the average result, making our conclusion potentially incorrect about the performance of each CPU in specific application areas.
Now we have identified what our objective is, and what benchmarks we wish to use. At this point we need to figure out how we will isolate each processor, which means using exactly the same components for each test, except for the processors. Changing any other hardware will mean that we need to identify the exact effect of that hardware change on our results. Failure to do so will potentially cause us to once again come to a wrong conclusion (i.e. we might think processor A is faster, but in reality the motherboards had differences that caused the results to be skewed).
In this situation, it turns out there is no way to actually make a direct processor comparison with a system level benchmark. By simply changing the motherboard, so many of the other elements have changed that it can only give you an idea of the difference in systems – which is exactly what a system level benchmark is intended to do. The reason for this is that each chipset has different capabilities and features, such as I/O bandwidth, memory bandwidth, etc. In addition, the location and speed of cache, bus architecture and other architectural difference make a direct comparison impossible in any real sense.
The only truly scientific way of comparing the two processors in this scenario is to note exactly what the differences in the systems are – including cache speed, memory timings, I/O bandwidth, graphics bandwidth, graphics drivers, chipset drivers and any other differences that can be identified. It is not sufficient to simply mention that they are different. In order to be able to draw any realistic conclusion, one must know what those differences mean in terms of overall performance impact. In many cases, of course, this is not easily measured, and it may actually be impossible to get this information at all. As a result, any conclusions that are drawn from the limited information available about the relative performance of the two processors are flawed due to the unknown impact of these other components.
In our example, there are actually only two possible ways to compare the performance. First is to use a component level benchmark, which will not necessarily provide us with the real world differences between the two processors. The other method is to simply indicate that two different systems are being compared, not just processors, so no direct comparisons of the processors is possible, except in a very general sense. If AMD were to make a Slot 1 K6-2, or Intel were to decide to make a Socket 7 Pentium II, we might find that the results are very different than what we see today. Though we might be able to make a recommendation for a particular platform (Slot 1 w/ Pentium II vs. Socket 7 with K6-2), we cannot truly claim one processor is superior to another.
Be the first to discuss this article!