The implications of this are far reaching. Not knowing how a print or web publication arrives at the scores for their reviews, one can only say that the scores so far presented as indicative of performance are, at best, optimistic. At worst they are flat out wrong. A lot don’t even comment on whether part A is faster than part B – neatly sidestepping the answer. If a reviewer ran a test (and this applies equally to any benchmark) once, and then published the results, they may have got an accurate result. It’s just as likely though, that they did not. So the question must be asked, “Can you rely on the numbers?”
This also means, that for serious reviewers, there is a lot more work. It also doesn’t mean do the test once and run it 25 times instead of five, it means that the test (of five runs) must be run more than once (if using the Winstone suite), at least three times, and preferably five.
Without knowing how much the scores varied for any benchmark, we cannot be sure if claims of “identical” performance or “superior” performance are true. Just because one score appears to be higher than another, it does not correlate that this translates to superior performance – it just might be more variable.
For what it’s worth, game-engine based tests tend to be less variable than system type benchmarks – but this shouldn’t come as a surprise because they are usually stressing a particular part (or relatively few parts) of a PC’s subsystem. A difference of 2 points in Winstone may be significant, but probably isn’t. The same tests run the same way on a different day may well yield a different result. A difference of 2 frames per second in a Quake 3 score on the other hand probably is significant, because the variability of the score is so much lower. But whether 248 or 250 frames per second is of importance to you is a another story… :)
Keep all this in perspective though. If you are basing your purchasing decisions on such small margins, I say you should think again…
Then there is the matter of eTestingLabs methodology of discarding runs if a run does not meet the 3% rule. The question I am posing is (assuming that the test was run correctly in the first place) “why would this make a run invalid?” Surely things could go “just right” for a tester on a particular day and a score could be achieved that is rarely likely to be equalled. Is this a reflection of performance that a person basing their purchasing decision in whole or part, will likely see? In my opinion it is not, and is in fact a distortion of the picture.
Where this works, and where this doesn’t…
Analysis of data by this technique readily lends itself to the testing of individual components, because we are able to isolate one and only one change – and can thereby attribute the change in scores to the changed component. Where the technique presented here falls down is when there are multiple component changes – but for that there are other tests, and other articles.
Mendenhall, William and Beaver, Robert J., “A Course in Business Statistics, Third Edition”, PWS-Kent Publishing Company, ISBN 0-534-92989-3, 1992.
Freund, John E., “Modern Elementary Statistics, Sixth Edition”, Prentice-Hall, ISBN 0-13-593559-8, 1984
Dietrich, II., Frank H., and Kearns, Thomas J., “Basic Statistics: An Inferential Approach”, Dellen Publishing Company, ISBN 0-02-328801-9, 1989.
Discuss (16 comments)