A New View of Statistics Go to: Next · Previous · Contents · Search · Home
Generalizing to a Population:
SAMPLE SIZE ON THE FLY continued

ON THE FLY: MISCELLANEOUS

On this last page devoted to sample size on the fly, I explain how to use it for any design and any outcome statistic. I then suggest what to say to the ethical committee when you apply for approval. I also warn you not to use statistical significance for sampling on the fly.

ON THE FLY FOR OTHER DESIGNS

Whatever the design and whatever the outcome statistic, if your stats program can produce a confidence interval for the outcome statistic, you can sample on the fly. Here is the procedure. First I explain how to do it for outcome statistics whose confidence interval has a width proportional to the square root of the sample size.

1. Decide on an acceptable width for the confidence interval of your outcome statistic. If the outcome statistic is a correlation coefficient or a frequency difference, there's no problem: the acceptable widths are 0.20 for a correlation and 20% for a frequency difference. Or you can choose a narrower confidence interval for the frequency difference, if it's a matter of life and death.
2. If the outcome is an effect size, the width depends on the value of the effect size, as shown in the figure on the page devoted to differences between means.
3. For other outcome statistics, work out what seems like a reasonable acceptable width for its confidence interval. It may depend on the magnitude of the statistic. For example, the relative risk and odds ratio clearly need wider confidence intervals for larger values of the statistic.
4. Start with a reasonable sample size. If it's a cross-sectional design, it will probably be around 50 subjects. If it's a longitudinal design and the outcome is derived from the repeated measure, then 10 or so will probably do the trick, provided the reliability isn't too bad.
5. The rest will sound familiar! I've copied it from the method for means in longitudinal studies.
6. Do the practical work.
7. Calculate the value of the outcome statistic and its confidence interval.
8. If your observed confidence interval is less than the acceptable confidence interval, the study is finished. If not, go to the next step.
9. Divide your observed confidence interval by the acceptable confidence interval, square the result, then multiply it by the total number of subjects you have tested. That's your next target total number of subjects.
10. Subtract the current total sample size from that target total. The result is the extra subjects for the next lot of practical work.
11. Do the practical work, add the data to the previous data, then go to Step 7.

If the confidence interval of your outcome statistic is not inversely proportional to the square root of the sample size, replace Step 9 with the following elegant procedure (which allows you to work out the relationship between sample size and the width of the confidence interval):

1. Make a double-sized sample by simply duplicating the sample and adding it back in with itself.
2. Analyze the double-sized sample with the stats program to get the confidence interval.
3. Add the new sample to itself to get a sample four times as big, then analyze it for the confidence interval.
4. Repeat to analyze a sample eight times as big, and 16 times as big.
5. Now plot sample size vs confidence interval, connect the points with a smooth curve, and read off the sample size corresponding to an acceptable confidence interval for the value of the outcome statistic from Step 7. Now go to Step 10.

ON THE FLY FOR THE ETHICAL COMMITTEE

You need to convince the ethical committee that you have the resources to go to the usual large number of subjects, if the effect turns out to be small. So you will have to provide an estimate of the worst-case sample size. You'll have to justify it using my approach with confidence intervals (which requires half the usual number), because you can't let statistical significance get anywhere near sample size on the fly. The two do not mix, as we'll see shortly.

To do a cross-sectional study properly, you must have the resources to test hundreds of subjects, if necessary. Don't forget to take into account known or guessed validities, which could push the number up by a factor of two or three.

For a longitudinal study, reliability is crucial for calculating how many subjects you might need. If you don't know or can't guess the reliability, you have to tell the committee that you simply don't know how many subjects you might end up with. So tell them that testing 10 or so subjects per group will be enough to detect large effects if the reliability is almost perfect, and it will give you enough data to estimate roughly the final sample size otherwise. Indicate the total number you will be able to test, and admit that this number may not be enough if the reliability turns out to be low. You will end up with a confidence interval that is wider than optimum, but the result may still be publishable. There's nothing you can do about it, and there's no ethical justification for your application to be refused, if you've got everything else right. After all, if no-one knows the reliability, someone has to start testing to find out how many subjects are needed. And it makes sense to do it during the experiment itself rather than to waste resources on a reliability study. But if you already have data from a reliability study, point out that uncertainty in the reliability makes a big difference to the estimate of the worst-case final sample size, so you might still be wrong with your estimate.

DO NOT FLY WITH STATISTICAL SIGNIFICANCE

It's important to understand that you sample until you get a narrow confidence interval. You do NOT sample until you get statistical significance. Let's see why.

If statistical significance is your goal, you would presumably start with a sample big enough to give statistical significance for large effects. For example, you might start searching for a correlation of 0.6, which you would want to find statistically significant (p<0.05) 80% of the time. From the formulae, the number of subjects is 13, so let's say you start with this number. If you get statistical significance, you stop. If not, you test more subjects.

Seems OK, but there are two things wrong. If the correlation does turn out to be statistically significant on the first go, it has such a wide confidence interval that the correlation in the population is likely to be anything from practically perfect down to trivial. In other words, there's an effect, yes, but you end up with little idea of how big it is.

The other problem is more serious: bias! With a true correlation of 0.6, a starting sample size of 13, and up to three rounds of extra sampling, the sample correlation ends up at 0.65 on average. For a true correlation of 0.40, the sample correlation averages 0.50. This amount of bias is unacceptable. Starting with a bigger sample helps, but as long as you make stopping contingent upon statistical significance, you will have substantial bias for most values of correlation. For example, a true correlation of 0.20 and a starting sample of 45 produce a correlation of 0.25 on average in the final sample. You could start with hundreds of subjects, I suppose, but by then you'd have defeated the purpose of sample sizing on the fly!

I wonder if sampling on the fly using statistical significance is a widespread practice, without people realizing it. By people I mean everyone, including the experimenters themselves. It's all too easy to start a study with a small sample, stop if you get statistical significance, or do a few more subjects to bring a promising p value below the 0.05 threshold!

A FINAL WARNING. Opting for sample size on the fly, then sky diving as soon as you get statistical significance, is forbidden. If your paper comes to me for review, I will reject it on the grounds that the result is biased and that the confidence interval is too wide.