What we are going to look at is the influence that the graphic (video) card has on the Winstone 2001 test suite. We will look at scores generated by eTestingLabs Content Creation 2001 v 1.02 and Business Winstone 2001 v 1.02. The two video cards used are based upon ST Microelectronics Kyro 2 chip and NVIDIA corp. GeForce GTS2 Pro. The video card is the only difference. The detailed test system specifications are here. First up I present the scores as decreed by eTestingLabs.
Which is the winner? A convincing win to the GTS2 Pro one would think. As nothing has changed except the video card, and the scores look to be greater than eTestingLabs margin of error (as a lot of reviewers understand the benchmark) of 3%, this looks to be “proof” that the GTS2 Pro is faster in these sorts of applications than the Kyro 2.
Now to the nitty-gritty. Consider the following graph. Here is a composite of five full runs of Content Creation Winstone 2001. A “run” of CC 2001 is five iterations. The first is discarded and the highest score reported from the remaining four is deemed to be the “score”.
The PC in question was exactly the same. Which is the valid run? If we had taken the first run, the system would have scored 70.9. If the second was chosen, 75.0. These two results differ by 5.8%, which is outside the eTestingLabs framework. Aside: eTestingLabs documentation deem that any variation greater than 3% reflects an invalid test. So what happens now? Do I discard the 25 runs and start another series of tests? This to me is an inherent weakness of the eTestingLabs methodology. Without performing multiple iterations, you won’t know if you have a valid run. And if you start down that path, why not keep going and just average all the scores?
So I discard the results of the first two runs and try again. Notice how the last three results are much closer together. In fact if you average the first two, the average is much closer to these three figures.
So why not average the five scores. In fact, why not average the 25 (five runs of five) runs? The average of the five best scores gives a figure of 73.0. The average of 25 runs gives 71.7.
Lets look at the data another way:
The blue line is the average of the 25 scores. The green line is the average of the 5 best (Winstone reported “scores”). Look hard at the data points. If we accept the premise that a PC’s performance will oscillate about a mean, which is the “best” representation of a performance “score”?
“But” I hear you cry, “it doesn’t matter because the same conditions affect everyone, so the relative positions should remain the same. We can still look at the graphs the way they are now and get a feel for relative performance differences. Basically Campbell, you are talking out of your fundament.”
Not so fast sunshine. The above would only be true, if the performance delta (how much the result varies) was the same for each CPU, system, video card or whatever you are testing. The variance may be the same, and it may not. And, in most cases, we (the reader) are not told. Unless you run the tests multiple times, you won’t know. And if you have run the tests multiple times, then you may as well use the multiple results. Aside: Is this a self fulfilling prophecy?
Lets look again at the two results.
The averages for both cards are there (71.7, 68.5) and in almost every case (all but one in fact) the GTS2 beats the Kyro 2. Looks compelling so far.
If we are going to use the mean as a gauge, how does that help us? Glad you asked. An entire branch of mathematics has sprung up designed to answer just such a question – statistics. Yes, lies, damned lies and…
It is also fortunate that I have decided to use the average score as this allows statistical analysis of the results, because the results that are generated should be approximately normally distributed (which is a requirement).
Discuss (16 comments)