A New View of Statistics

© 2002 Will G Hopkins

Go to: Next · Previous · Contents · Search · Home

Generalizing to a Population: REPEATED-MEASURES MODELS continued

Do a fitness test on a bunch of subjects.  Rank the subjects by their score and select the bottom half of the bunch.  Retest the bottom half.  The average score of the bottom half will probably improve somewhat on retest. Similarly, the average score of the top half will probably drop somewhat on retest.  These changes in performance are called regression to the mean. The name refers to a tendency for subjects who score below average on a test to do better next time, and for those who score above average to do worse.

The group you select doesn't have to be the bottom or top half, and the test doesn't have to be the first one.  Any group or even any subject you choose with an average score below or above the mean of all the subjects in a given test will probably move (regress) noticeably closer to the mean in another test.  In general the scores don't move completely to the mean–they just get closer to it.  It is therefore more accurate to call the phenomenon regression towards the mean.

OK, so low scorers tend to get better on retest, and high scorers tend to get worse?  Well, no, actually.  Depending on the nature of your data, the change in the scores towards the mean may be partly or even entirely a statistical artifact. If it's entirely an artifact, the true scores of the subjects don't really change on retest–it just looks that way.  When that happens in, for example, a training study, your analysis might lead you to conclude that the least fit subjects got a big benefit from the training, whereas the fittest subjects got a smaller benefit or may even have got worse.  In reality, all subjects may have increased in fitness by a similar amount, regardless of initial fitness.  Your conclusion about the effect of initial fitness could be artifactual garbage.

Regression to the mean can lead to similar mistakes with repeated observation or testing of the health or performance of an individual. Consider a patient with a chronic health problem. Depending on the problem, symptoms can fluctuate in severity over a period of weeks or months, for no apparent reason. When the symptoms get really bad, the patient may try a new alternative therapy. The symptoms then improve, because they were bound to improve from their atypical severe level. The patient can be forgiven for thinking that the new therapy worked. Later on, the patient stops taking the new therapy, the symptoms get bad again, the patient takes the therapy again, the symptoms improve... Get the picture? You can imagine a similar scenario with an athlete who turns in a particularly bad performance, then does something about it. Whatever the athlete does, it's likely to work–artifactually. Now you can understand why there is so much snake oil on the shelves of drug stores.

I'll now deal with the nature of artifact when you analyze data from a group of individuals. The subsections are: the cause of the artifact, the magnitude of the artifact, and how to avoid the artifact.

 Cause of the Artifact
Regression to the mean occurs because of noise (error) in the test score.  Noise refers to the random fluctuations in a subject's score between tests–the typical or standard error of measurement. When you select subjects who scored low in one test, their scores were low partly because the noise just happened to make the scores low in that test.  In other words, their true scores aren't really as low as the scores you selected.  When you retest these low scorers, their scores in the retest will on average be their true scores (plus or minus the noise of the test, of course), so the scores are likley to rise.  For the same reason, high scorers selected by you in one test are likely to fall on retest.  Average scorers, on the other hand, are equally likely to rise or fall, so on average they don't change.  The overall pattern is therefore for scores different from the mean in one test to regress towards the mean in another test.

The noise responsible for regression to the mean can come from two sources:  the measuring instrument (technical or technological noise) and the subjects themselves (within-subject variation from test to test).  I use the word instrument in its most generic sense: it could be a questionnaire, a device for measuring oxygen consumption, or whatever.   If the noise comes solely from the instrument, regression to the mean is unquestionably an artifact.  But if the noise is due to within-subject variation, there is a sense in which the regression to the mean is real. I'll explain with an example. 

Suppose you administer two fitness tests several months apart.  Several months is long enough for many subjects to change their fitness substantially: some will be fitter, some less fit.  The "noise" in the test could be due almost entirely to these random but real within-subject changes in fitness.  So when you select a subgroup with low fitness scores in the first test, the increase in their fitness in the second test is a real increase.  If the increase is real, is there still a problem?  Yes, because you could easily attribute the increase in fitness to something you had done between the tests, such as a training or nutritional intervention.  The increase in fitness is real, but some of it was going to happen anyway, regardless of whatever you did. There are many papers in the literature in which the authors did not take account of regression to the mean when they claimed that their treatment produced a bigger increase in fitness on subjects with lower initial fitness.

 Magnitude of the Artifact
There is a simple formula for estimating the magnitude of regression to the mean: on retest, scores will move towards the mean by a fraction given by 1 – r, where r is the reliability correlation between test and retest scores.  So, if r = 0.9, and you select a group of subjects whose average score is, say, 20 units above the mean, you can expect the average scores of those subjects to drop on retest by an average of (1 – 0.9)x20, or 2 units.  Obviously, the smaller the r, the bigger the fractional move towards the mean.  In the extreme case of r = 0, scores on retest regress on average all the way back to the mean.  The 1 – r formula comes from the page Regression to the Mean at Bill Trochim's stats site. There is no proof or reference for the formula at his site, but it checks out with my simulations.

The retest correlation is involved in regression to the mean, because the correlation is a measure of the magnitude of the noise in the measurement.  The formula for r is (SD2 – sd2)/SD2, where sd is the within-subject standard deviation (the typical or standard error of measurement, or the noise) and SD is the usual between-subject standard deviation in either test.  Rearranging, 1 – r = the fractional shift towards the mean = sd2/SD2.  If sd is small relative to SD, there is little regression to the mean.  At the other extreme, when SD = sd, subjects are effectively identical (the only difference between subjects is noise), so all pre-selected scores that differ from the mean will, on average, regress completely to the mean on retest.

The above formulae will allow you to estimate how much of a change in the mean is artifactual, but you should also be concerned about precision of the estimate, that is, the confidence limits for the true value.  Bill Trochim does not have a formula for the confidence limits for the adjusted change in the mean. In the next section I will explain how to use the formula and get confidence limits.

 How to Avoid the Artifact
Regression to the mean is a problem only when there is substantial noise in your dependent variable and you subdivide your subjects into groups that differ in their mean scores in one of the tests.  Using the best test available is one way to reduce the effect of noise, but that won't reduce noise represented by real random changes in the subjects over the period between the tests. Of course, you can avoid the problem by not subdividing your subjects on the basis of their initial scores, but it is nice to know how a subject's initial score affects the outcome of a treatment. For example, you should find out if people with high initial scores get little benefit, because it's a waste of time using the treatment on such people.  There are two approaches:  correct the change scores using a formula, or use a control group. I once had an additional approach on this page, based on using the mean of each subject's pre- and post-test scores to subdivide the subjects. This approach eliminates regression to the mean, but it works properly only when the effect of the subject's pre-test score on the effect of the treatment is small (relative to the between-subject standard deviation in the pre-test). In general, you won't know how big the effect of the pre-test score is, so I have had to shelve this approach for the time being.

Correct the Change Scores
To use this approach, you will need to know either the retest correlation coefficient (r) or the within-subject variation (standard deviation, sd) for the dependent variable. Both must come from a reliability study with subjects and time between tests similar to those in your study. In my experience, an appropriate reliability study is often not available in the literature, so you will have to guestimate the reliability from less applicable reliability studies. Guestimate an sd rather than an r, because r is sensitive to the between-subject standard deviation of the subjects in the reliability study.

Armed with the reliability sd or r, proceed as follows. Subtract the pre-test mean of all subjects from each subject's pre-test score.   Multiply that difference either by sd2/SD2 or by (1 - r), where SD is the usual between-subject standard deviation of your subjects in the pre-test. Now add the result (or subtract it when it is negative) to the post-pre change score for that subject. This corrected change score is free of the artifact. Use it in your analyses just as you would any change score. For example, do an unpaired t test to compare subjects with low vs high pre-test scores. Better still, plot the corrected change scores on the Y axis against the pre-test scores on the X axis. If the points form something like a line, derive the slope of the line as an estimate of the effect of pre-test score on the effect of the treatment.

Be aware that the confidence interval (or p value) for any effects involving the adjusted change score will be too small if the reliability study had a small sample size, owing to uncertainty in the estimate of sd or r. The effects, such as the difference between high and low scorers or the slope of the line in the examples above, will also be biased if the r or sd from the reliability study are substantially different from what your subjects would show in a reliability study with the same time between tests as in your study.

Use a Control Group
Using a control group is a better approach than correcting the change score. Actually, the approaches are fundamentally the same, because the control group is effectively the most appropriate reliability study for correcting the change scores. But don't use the control group to correct each subject's change score. Instead, analyze the effect of the pre-test score on the change score in both groups in the same manner, then compare the effect in the treatment group with that in the control group. The analysis will require a two-way analysis of variance (ANOVA) or covariance (ANCOVA). For example, suppose Ychng is the dependent variable representing each subject's post-pre change score, suppose Group has levels control and intervention, and suppose Prescore represents the pre-test score. The model is:

      Ychng <= Group Prescore Group*Prescore.

If Prescore has the numeric values of the pre-test score, the model represents an ANCOVA. If instead you have coded the pre-test scores into two levels, such as low and high, the model is a 2-way ANOVA. Not that it matters what you call it--either way, you are interested only in the interaction term Group*Prescore, which yields the difference between the groups in the effect of the pre-test score on the change score (that is, on the effect of the treatment).

Watch out for non-uniform error! The standard deviation of the change scores in the treatment group may be larger than that in the control group, and there may be differences in the standard deviation for different values of Prescore, when there is a substantial true effect of pre-test score on the change score. The only way to take such non-uniform error into account properly is to use mixed modeling to specify different error terms for the different groups. Sorry, that's the way it is, guys. It's time you upskilled to the mixed model.

Go to: Next · Previous · Contents · Search · Home
Last updated 26 June 06