A New View of Statistics | |
Go to: Next · Previous · Contents · Search · Home Summarizing Data: SIMPLE STATISTICS & EFFECT STATISTICS continued |
The median, or "middle" number, can be useful for data with a non-normal distribution. To work it out, arrange the numbers in rank order (smallest to largest), then count in from one end until you find the middle. (If the sample size is an even number, take the average of the two middle numbers.) The median is not affected by outliers, which is a big point in its favor. But if you're interested in getting an estimate of the center of a population or of a subgroup of a population--and you usually are--the median is a coarse or "noisy" measure.
The mode, or most frequent number, is the only other
measure of centrality you'll ever encounter. I've never used it.
The range is a bad measure of spread, for two reasons. First, it's dictated by outliers, whether they're errors in data entry or genuine values. Secondly, the range is dependent on the size of your sample: the more numbers, the bigger the range is likely to be. Two measures of spread that avoid these problems are the standard deviation (SD) and percentile ranges. I'll deal with these separately, and with these other measures of variation: the root mean square error (RMSE) and the standard error of the estimate (SEE). I explain on a separate page why the standard error of the mean is a measure of spread you should not use.
The statistics most people use to describe a set of numbers are sample size, mean, and standard deviation. All you need to define the shape of the normal distribution is the mean and the standard deviation. The mean and standard deviation are often written as mean ± SD: 67.8 ± 3.6 kg, for example.
In dealing with the spread in a bunch of numbers, we often think about the numbers as representing values of some characteristic, such as weight, for different subjects. But the bunch of numbers could represent the weight of a single subject measured many times. We talk about between-subject variation and within-subject variation to distinguish between these two types of spread. Within-subject varation comes up soon as a useful measure of reliability.