A New View of Statistics | |
What I said on the last few pages about t tests of ordinal variables and t tests of Likert-scale variables applies also to counts: t tests are usually OK, and they will fall over only when you have a small sample size and more than 70% of your subjects score zero counts (because then the sampling distribution of the difference between the means won't be close enough to normal).
When you are fitting lines or curves, you also have to worry about non-uniformity of residuals. With counts, this worry is very real, because the variation in a given count from sample to sample depends on how big the count is. For example, the typical variation (standard deviation) in a count is usually simply the square root of the count, so a count of about 400 injuries varies typically by ±20, whereas a count of about 40 injuries varies typically by ±6. I hope it's obvious that the residuals for injury counts of 400 will therefore be much larger than those for counts of about 40. Rank transformation would fix these non-uniform residuals, but better approaches are available: binomial regression, Poisson regression, square-root transformation and arcsine-root transformation.
Binomial and Poisson Regression
When counts have a smallish upper bound (e.g., the number
of injured players in a squad of 24 is at most 24), the counts from sample to
sample vary according to what is known as a binomial distribution. When
the upper bound is very large compared with the observed values of the count
(e.g., the number of spinal injuries in American football each year), the counts
have a Poisson distribution. With a good stats program, you can dial
up an analysis that uses either of these distributions. The result is a binomial regression or a
Poisson regression. In the Statistical Analysis
System, you can do these analyses with Proc Genmod. Genmod stands for generalized linear modeling, which is an advanced form of general linear modeling that allows for the properties of non-normally distributed variables such as counts and proportions based on counts.
Don't feel intimidated by binomial and Poisson. Are you happy with the notion that the values of most variables have the bell-shaped normal distribution? OK, counts or proportions of something don't have the normal shape when the counts are small, so we need different mathematics to describe their shapes, and different names for them. As counts get larger, the shapes of the binomial and Poisson distributions tend towards the normal shape. You still have the problem of non-uniform residuals, though, because the variability from observation to observation for larger counts is more (in absolute values) or less (in percentage terms) than for smaller counts. Binomial and Poisson regressions and other forms of generalized linear modeling take care of the non-uniformity. For more on generalized linear modeling, in particular the specification and use of distributions and link functions, read this message I sent to the Sportscience email list in July 2004. .
Square-root and Arcsine-root Transformation
One way to deal with non-uniform residuals is to transform
the variable. We've seen that log transformation
works for some variables, and rank transformation
works for most variables as a last resort. Is there a transformation for counts
that will allow us to use normal analyses instead of binomial or Poisson regression?
Yes, provided you aren't close to some upper bound in the counts, just use the
square root of the counts in the usual analyses. When you've derived
the outcome statistic and its confidence limits, assess their magnitudes with
Cohen's or my scale of effect sizes, as I explained
for rank transformation. You can't back-transform an effect (such as a difference
between means) into a count by squaring it, but you can get a feel for the magnitude
as a count relative to the mean by adding the value of the effect appropriately
to the mean of the square-rooted counts, then squaring it. Square the mean for
comparison. Add each of the confidence limits of the effect to the square-rooted
mean and square it to get a feel for the precision of the magnitude.
Read the cautionary note about how the value of a back-transformed mean is not the same as the mean of the raw variable. For a simple example, imagine you have a team with only one injury this season and another team with nine injuries. The mean of the raw number of injuries is (1+9)/2 = 5. But the mean of the root-transformed injury counts is (1+3)/2 = 2, and when you square 2 to back-transform it you get 4!
Proportions require an exotic transformation called arcsine-root. To use this transformation, express the proportion as a number between 0 and 1 (e.g., 210 Type I muscle fibers in a biopsy of 542 total fibers represents a proportion of 210/542 = 0.387). Now take the square root and find the inverse sine (arcsine) of the resulting number; in other words, find the angle whose sine is the square root of the proportion. (The angle can be in degrees or radians, where 360 degrees is 2 pi radians.) Use that weird variable in your analysis, but weight each observation by the number in the denominator of the proportion, to ensure that the residuals in the analysis are uniform. You'll have to read the documentation for your stats program to see how to apply a weighting factor. To gauge magnitude of effects with an arcsine-root transformed variable, apply the Cohen or Hopkins scale, as explained for rank transformation. The appropriate standard deviation is the root-mean square error from the analysis of the transformed variable, because this error should take into account the weighting factors. As is the case for counts, back-transformation of the observed effect works only if you add the effect appropriately to the mean before taking its sine and squaring it. Multiply the result by 100 if you want it as a percent. Do the same with the confidence limits.
The square root and arcsine-root transformations work well even for low counts or zero proportions. As with ordinal variables, you'll get into trouble only with small sample sizes when more than 70% of your subjects have a score of zero or a proportion of zero. Then you have to use binomial or Poisson regression.
Phew! The square-root and arcsine-root approaches are complex. I recommend that you come to terms with a stats package that offers binomial and Poisson regression or generalized linear modeling.