But What is the Idea Behind Statistical Analysis?
For the purposes of this article, statistical theory is based on a concept called the normal distribution/curve. I’m sure you have heard of it. The premise is that in any random sample, the results will cluster about some mean, or average value. As you get closer to the mean, you get more results. As you get further away from the mean, you get fewer. In the ideal case, 95% of values are within 2 standard deviations either side of the mean. What this means is that the area underneath this “bell curve” gets smaller and smaller the further you move away from the mean (average). The further you move away from the mean, the less chance you have of seeing a result – not impossible mind, but very unlikely. This is the problem with using the highest score as a means of representing performance, you might have a high score that is close to the mean, or you may have an outlier, a 1-in-1000 score. Without knowing how all the results paned out though, you won’t know, and in nearly all of the reviews that I have read this small but important bit of data isn’t disclosed.
In the interest of keeping this article to a manageable size, and on topic I will skip the statistics backgrounder as there are many texts that deal with statistics on various levels which can answer your questions better than I.
What we are interested in is the part that applies to analysing the results we are dealing with – comparing two samples. The tricky part is determining if the samples are dependent or independent. An independent sample is one in which objects from the two “populations” are unrelated. If however the two populations (what we are trying to measure) are related such that when an object (in our case a Winstone score) chosen from one population (group of scores) another is chosen from the other population, we have a dependent sample.
In this particular case, where we have a system where we are changing one variable (the video card), we would have a dependent sample. Also, note that we only have 25 samples, and so we use a T distribution rather than the normal (z) distribution. This talk about distributions is simply the shape of the bell curve that is generated from a sample of results. The curve we are using (the T distribution) has a slightly different curve to make up for small samples sizes (< 30). Why 30? Simple. The shape of the T distribution is very close to the normal distribution when your sample has 30 or more values – so you would use the normal curve for sample sizes greater than 30.
Resolving a question via statistical means involves generating hypotheses – a claim and an alternate. In our case, the claim (called the null hypothesis – H0) is that the performance of Kyro 2 = GeForce 2 GTS. The alternate hypothesis (Ha) is that the performance of Kyro 2 < GeForce 2 GTS. I should note that hypothesis are generally constructed in such a way that the alternative (Ha) hypothesis is the one that the researcher generally believes to be the correct view.
We now introduce the rejection region. The rejection region is the area under the curve which a result falling in this region would lead us to believe that the alternate hypothesis was in fact the correct one. If our experimental results fall mainly within this region, then we would accept the proposal of the alternate hypothesis as the chances of observing a large number of rare cases is very small.
Trouble looms however. It is always possible to make an incorrect decision. And we can never eliminate the possibility of making such an error. What we can do is minimise the likelihood of making an incorrect error. The table below indicates the things that can go wrong.
|Decision||H0 is correct||Ha is correct|
|Reject H0 (Ha is correct)||Type I error||Correct|
|Accept H0||Correct||Type II error|
As I have indicated before, experimenters usually structure the hypotheses in such a way that they believe the alternate hypothesis to be correct. What we must therefore do is minimise the chances of rejecting H0 when it is actually correct. By doing this we avoid touting our hypothesis as correct when it is not. But remember, there is always still a small chance that we will get it wrong.
What we want to decide is if the mean scores obtained using the Kyro 2 card is different to the mean scores using the GeForce2 GTS video card. What we want to do is determine the difference between two means (U1 – U2). We do this by considering a population of differences that has a mean (UD).
H0: UD = D0
Ha: UD < D0
Now for some formulae. The test statistic is given by
SDand XD are calculated values:
An explanation of the terms.
- nd is the number of sample differences (25)
- SD is the sample standard deviation
- D0 is the hypothesised mean difference (and usually zero)
- D is the difference between the two results
Plugging in the numbers (table of results):
XD = 80.2/25 = 3.208
SD = 3.134
For the sake of this article I will define the rejection region using a 95% confidence interval – in other words I will be 95% sure that I am correct. The value is taken from a series of pre calculated table. For n-1 degrees of freedom (one-tailed), the value is 1.71. Therefore reject H0 if t > 1.71
t = 3.208 – 0/(3.134/5)
t = 3.208/0.6268
t = 5.156
As t is greater than 1.71, we reject the null hypothesis and accept the alternate, that the performance of the Kyro 2 is less than the GeForce GTS2 in Content Creation Winstone 2001.
Big deal you say. I could have told you that by looking at the graph or even the original bar graph.
OK clever person, take a look at this graph
Normally this would appear in a “review” with the indication being that the Kyro 2 is beaten by the GeForce2 Pro (again). The line graph doesn’t help much.
It appears that again the Kyro is outpaced. But is it? Using the same hypotheses as above and plugging in the numbers we get t = 1.361 – which is not larger than 1.71, so we do not reject H0 this time. The upshot? For Business Winstone 2001, there is not enough evidence to support a conclusion that the GeForce GTS2 is faster.
What’s that you say? Don’t you mean that the cards are equal in performance? No I don’t. Remember the discussion about Type I and Type II errors? Just because we have not produced evidence to support our theory, it does not automatically mean that we embrace the corollary. To do so would run the risk of introducing a type II error! We have to fall back upon the (rather unsatisfactory) statement of not rejecting H0, which is not the same as accepting it. What we have is a situation where we have failed to prove our hypothesis (remember, we framed the null and alternate hypotheses in such a way as we believed Ha to actually be the correct hypothesis). We then constructed our analysis in such a way as to minimise the chances of getting a type I error (rejecting H0 when it is in fact correct). But this leads to a greater risk of producing type II errors so in order to minimise the chances of making this error, we are conservative in our conclusions. To form a definitive conclusion, we would have to redo our tests and analysis.
This is where statistical analysis methods come into their own. Statistical analysis can tell us if a difference in performance is significant or not within a series of marginal results. There is no guesswork, and no "voodoo" numbers. We have been as methodical as we can and based on the results as run, the results are a dead heat. And this is in stark contrast to what some reviews would have you believe.
Discuss (16 comments)