What is the worth of a tool if the results it gives you are misinterpreted?
This article looks at the way eTesting Labs Inc. Winstone 2001 results are reported. What I am trying to do is examine the validity of the results for this test suite that are published over the web and in print, and I hope, open up a discussion on this subject. I will also present an alternative reporting schema for comment. The general approach, comments and issues raised by this article apply equally to other benchmarks as well.
For those not familiar with the Winstone 2001 test suite, or those who would like more background material are encouraged to read this piece written by Dean Kent, entitled: Business Winstone and Content Creation Winstone, An In-Depth Look
What started all this?
Running lots of tests. :) Seriously, many repetitions of the tests on the same hardware yielded different results. Surely, if the tests were accurate, then multiple runs of the tests should produce a ‘score’ that is the same or within very narrow margins of each other. In fact, eTestingLabs guidelines state that the final score reported during multiple runs should be within 3% of each other. Sometimes this was the case, and sometimes not. But this begs an interesting question – If we are reporting the high score as the overall result, then how can you be sure you have recorded it? Might not another run produce a higher score? The simple answer is it might, and it might not.
So where does that leave us?
The high score as a result. Is this really representative of what the system in question is capable of? Surely the highest score represents a value that is obtained where everything is going for you. The converse is true of the lowest score (when everything is against you). But what exactly are we trying to do when benchmarking a system? We are trying to gauge performance. And this performance will fluctuate about a mean. So why isn’t the mean used?
The answer to that lies in the origin of the benchmark - Ziff-Davis Media. What we have is a benchmark that has a result system that is designed to deliver a result. Consider this graph:
To the average (statistically illiterate – that is not a criticism by the way) reader, what does it show? PC 2 is the winner that’s what. Without knowing how the actual results panned out, that’s how such data is often (mostly!) presented, and how readers perceive it.
But is it true?
Again, without access to the full result suite, who knows? I would be inclined to think that the results as presented, show that the two systems are about equal in performance – but how do I convey that idea to a readership? Is it even possible?
The nearest analogy I can think of is drag racing. Here we have contestants who pair off and run again each other, eliminating the loser until only one is left and a winner declared. Realistically, any one of a dozen top contestants could win, and it just depends on their performance on the day. Computers are like that. With results like those above, I could easily run the tests again and come up with a different winner.
The results shown in Figure 1 are pretty close though, and if I (as the author) said in my analysis that the results should be interpreted as being equal, then most readers would accept that, but there would be this nagging doubt that perhaps PC 2 was faster than PC 1. This is where readers perceptions get coloured.
The aim of the rest of this article is to examine a full set of results and see if statistical analysis methods verify the proclamation of a winner. I will initially present you with a real life result, and then we will take the data apart.
Discuss (16 comments)