|

Sample standard deviation
The sample mean is a random variable whose variability is related to the population standard deviation and sample size. Knowing the size of this variability allows us to measure the precision of the sample mean as an estimate of the population mean. However, in practice we rarely know the population standard deviation.
We can solve this problem by estimating this unknown parameter using the sample standard deviation, defined by this equation. But where does this definition come from?
Firstly, the aim is to measure variability about the sample mean. Suppose we take a sample of 5 female UQ students and measure their heights, recording 172, 157, 174, 174, and 165 cm. This gives a sample mean of 168.4 cm. A simple measure would be to add up all the differences between the observations and the sample mean; 3.6, -11.4, 5.6, 5.6, and -3.4. If there was a lot of variability then these numbers would tend to be bigger, and so we'd expect the total to be bigger, making it a good measure. However, when you add these differences you get 0, and it doesn't matter what the sample values were.
It is easy to fix this by summing the absolute values instead; 3.6, 11.4, 5.6, 5.6, and 3.4, giving 29.6. the cancelling problem is gone and so we have a good measure of spread. The problem with this idea is that if you plot the absolute value function you'll notice a very sharp corner at 0. Calculus, an important mathematical tool, works best with smooth graphs, and this corner makes life awkward.
Another way of making everything positive is to square the differences; 12.96, 129.96, 31.36, 31.36, and 11.56, giving 217.2. A graph of the square function is just a parabola, nice and smooth. You will see a lot of sums of squared deviations in statistics, such as in regression and ANOVA.
Notice though if we use the sum of squared deviations to measure the variability of samples with different sample sizes then the biggest sample would usually give a bigger sum, even if the samples came from the same population. So we should average the squared deviations by dividing by n.
In fact we don't quite do this. If you have 1 observation then you really have 0 information about the variability. If you have 2 observations then you can add up two squared differences, but they are identical, so you only have 1 piece of information. In general, knowing n-1 of the differences tells us the last one (which is why we always got 0 earlier). So we are really averaging n-1 free differences (hence the name, degrees of freedom), and so we divide by n-1. Here 217.2/4 is 54.3.
Because we squared all the differences the units are now cm2. To get back to cm, the last thing we do is to take the square root. Thus the sample standard deviation of the height data is 7.37 cm.
Finally, note that the sample standard deviation is a random variable, just like the sample mean. If we took another 5 female UQ students then we would likely get a value different to 7.37 cm.
In all, this equation has a lot of ideas packed into it! One of the highlights of the gallery.
|