Standard deviation recipe

Below is a draft of a page of a study pack for B/TEC applied science students that I’m writing. I don’t recommend using the other SD formula!

Suppose you measure the heights of 100 children, say girls aged 7. You would expect to find a few very short children, a few who were very tall for their age, and most would fall in the middle range. If you were to tally the number of children in a series of height intervals, and then plot a bar chart or ‘histogram’ of the heights with height interval along the bottom and frequency along the vertical axis, you would see a pattern that was close to the familiar bell shaped curve often referred to as the Normal Distribution.

The mean (and median and mode) of the heights will act as a good ‘typical’ value for the group as the distribution of heights is roughly symmetrical.

You might want to have a measure of how spread out the data is, or how wide the bar chart is. One simple to calculate measure of spread or variation is the ‘range’. In statistics, the word range means the difference between the largest and smallest data items – in our example you take the height of the tallest child and subtract the height of the shortest child. This measure of spread lacks robustness as the value will depend entirely on two girls from the group – and the children are by definition at the extremes of the range of heights.

A better measure of spread called the standard deviation can be calculated easily and is robust in the sense that the value of the standard deviation depends on all the values in the data set.

To calculate the standard deviation for the raw data set (say 10 values) you just follow the steps below…

Step 1: Find the arithmetic mean of the data by finding the total of all the heights and then dividing by the number of heights. Round the mean off to a sensible number of decimal places.

Step 2: For each data value (heights in this example) subtract the mean from the data value. Ignore the sign of the difference.

Step 3: Square each of the results obtained in step 2 (this is why we could ignore the sign of the difference)

Step 4: Find the total of the ‘squares of the deviations from the mean’ you found in the last step

Step 5: Divide the total by one less than the number of results (in the example, we divide by 9 as 10-9).

Step 6: Take the square root of the value calculated in step 5.

This result is properly called the sample standard deviation of the data. So what does this number mean? How can we derive benefit from knowing it?

One answer to the question above depends on some properties of the Normal Distribution. Suppose a set of results is known to be distributed according to the Normal Distribution and that the data set has a mean of (say) 108 and a standard deviation of (say) 13. If we work out the mean less one standard deviation (108 – 13) and the mean plus one standard deviation (108 + 13), we will end up with an interval from 95 to 121. If the measurements are normally distributed, we can say that about 65% of the results should fall between these limits.

For two standard deviations above and two standard deviations below (82 to 134 using above example) you would expect 95.5% of the values to fall within the interval, and for three standard deviations either side of the mean (69 to 147 in the example) you would expect 99.7% of the results to fall within the range.

If you encounter a value that is more than 3 standard deviations above or below the mean, and if you accept that the variable is likely to be normally distributed, then your very large or small value can be labelled as an ‘outlier’. You might need to check the result or see if there is any reason for the anomalous value.

Comments are closed.