Ubuntu 5.10

December 31st, 2005

Picture message

December 27th, 2005

Leaf from phone message

Dusk

December 27th, 2005

Junk yard overlooking city

Halford's corner - some movement with flash switched off

Christmas eve

December 24th, 2005

A little Christmas cheer - off air for a few days now - too busy eating and drinking in the mid-winter fashion

Standard deviation recipe

December 24th, 2005

Below is a draft of a page of a study pack for B/TEC applied science students that I’m writing. I don’t recommend using the other SD formula!

Suppose you measure the heights of 100 children, say girls aged 7. You would expect to find a few very short children, a few who were very tall for their age, and most would fall in the middle range. If you were to tally the number of children in a series of height intervals, and then plot a bar chart or ‘histogram’ of the heights with height interval along the bottom and frequency along the vertical axis, you would see a pattern that was close to the familiar bell shaped curve often referred to as the Normal Distribution.

The mean (and median and mode) of the heights will act as a good ‘typical’ value for the group as the distribution of heights is roughly symmetrical.

You might want to have a measure of how spread out the data is, or how wide the bar chart is. One simple to calculate measure of spread or variation is the ‘range’. In statistics, the word range means the difference between the largest and smallest data items – in our example you take the height of the tallest child and subtract the height of the shortest child. This measure of spread lacks robustness as the value will depend entirely on two girls from the group – and the children are by definition at the extremes of the range of heights.

A better measure of spread called the standard deviation can be calculated easily and is robust in the sense that the value of the standard deviation depends on all the values in the data set.

To calculate the standard deviation for the raw data set (say 10 values) you just follow the steps below…

Step 1: Find the arithmetic mean of the data by finding the total of all the heights and then dividing by the number of heights. Round the mean off to a sensible number of decimal places.

Step 2: For each data value (heights in this example) subtract the mean from the data value. Ignore the sign of the difference.

Step 3: Square each of the results obtained in step 2 (this is why we could ignore the sign of the difference)

Step 4: Find the total of the ‘squares of the deviations from the mean’ you found in the last step

Step 5: Divide the total by one less than the number of results (in the example, we divide by 9 as 10-9).

Step 6: Take the square root of the value calculated in step 5.

This result is properly called the sample standard deviation of the data. So what does this number mean? How can we derive benefit from knowing it?

One answer to the question above depends on some properties of the Normal Distribution. Suppose a set of results is known to be distributed according to the Normal Distribution and that the data set has a mean of (say) 108 and a standard deviation of (say) 13. If we work out the mean less one standard deviation (108 – 13) and the mean plus one standard deviation (108 + 13), we will end up with an interval from 95 to 121. If the measurements are normally distributed, we can say that about 65% of the results should fall between these limits.

For two standard deviations above and two standard deviations below (82 to 134 using above example) you would expect 95.5% of the values to fall within the interval, and for three standard deviations either side of the mean (69 to 147 in the example) you would expect 99.7% of the results to fall within the range.

If you encounter a value that is more than 3 standard deviations above or below the mean, and if you accept that the variable is likely to be normally distributed, then your very large or small value can be labelled as an ‘outlier’. You might need to check the result or see if there is any reason for the anomalous value.

Chi-square statistic and test

December 23rd, 2005

The draft below is part of a study pack I am producing for level 3 students following a B/TEC Maths and Stats unit.

A theory predicts a certain outcome for an experiment – usually found by multiplying list of the probabilities of various outcomes by a sample size. We shall call these predictions the ‘expected values’. You then run the experiment and you are not surprised to find that the observed values – one result for each expected value – are slightly different when compared with the corresponding expected value. But the crucial question is ‘how different’ and then perhaps ‘are the observed values sufficiently different from the expected values to make me disbelieve the theory I used to calculate the expected results’?

In order to answer the first question, we need to find a way of calculating a number that will tell us ‘how far’ or ‘how different’ the expected values are from your observed values. Such a number or ‘statistic’ has been invented, discovered or devised: the chi-squared statistic. The recipe for calculating the chi-squared statistic follows…

Step 1: calculate a list of your expected values based on the theory that you are trying to test. Your list may contain as few as two values or may contain many entries, possibly organised as rows and columns, it all depends on your theory.

Step 2: perform your experiment and list your observed values in a way that facilitates comparison with your observed values (a table springs to mind). Bear in mind that your observed values (and expected values) must be absolute values and not percentages, proportions or fractions. If you are calculating expected counts or frequencies, don’t round off your expected values to the nearest whole one – leave them to a sensible number of decimal places.

Step 3: For each pair of expected and the corresponding observed values, find the difference. Ignore the sign of the difference

Step 4: Square each of the differences found in the last step

Step 5: Divide each squared difference by the corresponding expected value

Step 6: Find the total of all these squared divided differences.

The number you are left with will be positive and may be large (i.e. 10 or 20) or small (i.e. 0.07). This total score is called the chi-square statistic.

Our second question can be summarised as ‘is the chi-squared statistic so large that I can’t believe that the theory describes the experimental situation accurately’?. The answer to this question involves adopting a probability – we say that we will believe that the expected values are not consistent with the observed values if the probability that they are really the same and that random variations explain the difference is less than 5% (or 1%, or 0.5% depending on the level of ‘false negatives’ we are content to accept). We then use a set of chi-square tables to look up a critical value of the chi-squared statistic for the 5% (or 1% or 0.5%) probability level. If the calculated value of the chi-squared statistic is higher than the critical value for the chosen probability level, then we cannot assume that the theory adequately describes the results of the experiment. If the calculated chi-squared statistic is less than the critical value we found from the tables, then we can assume that the theory is consistent with the observed values. When the two values for the chi-squared statistic are close, then we have a judgement call.

There is a catch – in order to find the critical value for the chi-squared statistic, we have to decide on a probability level and we have to know the ‘degrees of freedom’ available in the data. We then enter the chi-square table at the column corresponding to our chosen probability level and at the row corresponding to the appropriate number of degrees of freedom.

The procedure used to calculate the number of degrees of freedom appropriate to a set of data depends on the number of rows and columns in the data. In the case of a simple one column list, we just take one less than the number of items in the list as the degree of freedom for that data.

In the case of a number of rows and columns, you take one less than the number of rows, and then take one less than the number of columns, and then multiply the two numbers. In this way a block of observed values with 3 columns of 10 numbers will have 18 degrees of freedom.

A special case will be of primary interest to us: that of a table of results that contains just two numbers. In this case, there is one degree of freedom.

The concept of degrees of freedom is easy to relate to the experiment involving fruit flies that we will turn to shortly. If you have exactly 60 flies (and counting flies that can move and still become airborne is a skill in itself), and you know that 14 of those flies have vestigial wing forms, then you also know that 46 of the flies have fully formed wings. Once you have specified one number, the other is known by subtraction from the total. There is effectively only one random number in the experiment, hence one degree of freedom.

There is one more correction that is applied in the special case of a set of results with just two numbers – the case that is of primary interest to us. Yates argued that with only one column and two numbers, the differences between the expected and observed results will always be the same but with opposite sign. He went on to argue that this ‘heads and tails’ property of the differences would lead to a ‘lumpy’ chi-square value, and he proposed a ‘continuity correction’ that would smooth out the ‘lumpiness’.

There is also a limitation: the expected values in each cell of your table must be greater than about 5, as we are assuming a normal distribution of differences about the expected value. For frequencies less than a minimum of 5 in a given cell, the chance of a negative deviation is different to the chance of a positive deviation of similar size and the distribution of deviations can no longer be assumed to be symmetrical.

Cool or crap?

December 22nd, 2005

Jeffrey Zeldman without his wooly hat

“The bad news is that college and university design curricula are still mostly about everything but information architecture, usability, application design, user-focused design, accessibility, and web standards.”
Jeffrey Zeldman

The Zeldman article was very influential at the time and makes good reading now. Alas, the new Adobe page is forcing me to download a plug-in to read the article! And to load the player, I have to quit Safari!!

To read the article you need to have Flash Player 8 installed. Worth reading.