A Look at Benchmark Reporting
A strange thing happened after the publication of my KT7A vs KA266 article…I had to think :0
I got called to task by “Doug” in the RealworldTech forums about the different Winstone scores, and attributing the difference to the hard drive controller software (the IDE/ATAPI drivers).
“So.” I thought, “I’ll have to go benchmarking again.” Which I did. But, don’t rush off and compare the old numbers with the new ones I will present, because they use two different methodologies. The “old” method I used was the Ziff-Davis recommended way to run the Winstone tests:
- Run the test five times,
- Discard the first run
- Report the highest of the next four results as the “score” – but only if the five results are within 3% of each other. If the variation is greater than 3%, re-run the tests.
Fine, but on single drive setups, it’s hard to get within that 3% error – causing the test to be re-run many times. And when re-running the tests many times, I found that the test results themselves would vary by more than 3%. I have had, for example, on the exact same setup (Athlon 1.33GHz/75GXP) return a valid Content Creation 2001 Winstone score of 63.5, and 67.7 (and a few others in between).
Which one is valid? Well they both are. Which one should I present in results? Either – depending on whether I wanted to show AMD in a good or bad light. So after much thinking, it’s time for a change.
My new method of testing is thus:
- Run the test seven times
- Discard high and low scores
- Calculate percentage difference on remaining scores – re-run if more than three percent difference in the scores.
- Average the remainder.
Now this remainder is usually an average of five scores – but not necessarily. What if there were two low scores? Well I would discard both, and the high score, and average the remaining four. If there were two low, and two high, I would discard the four and average the remaining three. If the high and low scores were such that there would be less than three samples after the discards – re-run the tests.
What does this new measure provide? – It is a more representative (dare I say it, more “real world”) measure of the scores obtained. Why? Because in practice, you will find that any action you are measuring will oscillate about a mean figure, you won’t score the high figure all the time. Also in practice, this score will be slightly (a point or two) lower than comparable setups in other reviews you might see, but I see this as a fairer, more representative indication of performance – and it is the same for all players. With the error margins included, you should also be able to see if a small performance difference is meaningful or not.
Be the first to discuss this article!