New View of Statistics: On The Fly for Correlations

A New View of Statistics

© 1997 Will G Hopkins

Go to: Next · Previous · Contents · Search · Home

Generalizing to a Population:
SAMPLE SIZE ON THE FLY continued

ON THE FLY FOR CORRELATIONS

The research question here is simply this: how linear is the relationship between two numeric variables, like weight and height? The extent of the linearity is captured beautifully by the correlation coefficient, so that's the outcome statistic we focus on.

As I've explained already on the previous page, to do the research on the fly, you keep sampling until the confidence interval for the correlation falls below 0.20. Here's how to go about it.

What's the maximum the correlation could ever be in the population you are studying? Start with a sample size that would give a confidence interval of 0.20 for that correlation. Use the graph below to read off this sample size. (The graph is just an adaptation of the figure on the previous page, to allow you to get the sample size corresponding to any correlation.)

Curve is fitted to empirically-derived sample sizes that give confidence intervals of 0.20 for correlations in the "middle" of the steps of the magnitude scale. More information and simulation program.

The smallest sample size is about 45, which corresponds to correlations of 0.82 or higher. Correlations of 0.90 or more are a special case I'll deal with separately.

Do the practical work and calculate the correlation for the initial sample.
If the observed correlation is higher than the correlation corresponding to the initial sample size, the confidence interval must be less than 0.20, so the study is finished. If not, go to the next step.
Use the graph to read off the sample size that would give your correlation a confidence interval of 0.20.
Subtract the current total sample size from that sample size on the graph. The result is the number of subjects for the next lot of practical work.
Do the practical work, add the new observations to all the previous ones, then calculate the correlation for the whole lot.
If the correlations is higher than the previous correlation, the confidence interval must be less than 0.20. The study is finished. Otherwise go to Step 4.

Here's an example. You want to find the correlation between height and weight in a population. You think it will be very large, so you start with 45 subjects. You get a correlation of 0.71. The graph shows the corresponding sample size is about 95. So sample another 50 subjects (= 95 - 45), then calculate the correlation for all 95. You get 0.67, which means about 120 subjects. Off you go, test another 25. This time the correlation for all 120 subjects is 0.69. Stop. Publish.

The chance that you will finish on each round after the initial one is 50% or less, so the chance of having to go more than three extra rounds is about 10% or less. By then, my simulations show that typically you're adding only 5% to the total number of subjects, so you'll converge rapidly on the final correlation.

Confidence Limits for the Correlation

Naturally, you're expected to give the confidence limits of the correlation coefficient you end up with. If your stats program doesn't generate them, you'll have to use the Fisher z transformation: z = 0.5log[(1 + r)/(1 - r)]. The transformed correlation (z) is normally distributed with variance 1/(n - 3), so the 95% confidence limits are given by z ± 1.96/sqrt(n - 3). You then have to back-transform these limits to correlation coefficients using the equation r = [(e^2z - 1)/(e^2z + 1)]. This is standard stuff for statisticians, but as a mere mortal you'll be struggling. I've set it up on the spreadsheet for confidence limits.

More on the Initial and Final Sample Sizes

You will be tempted to start with 45 every time, hoping that you won't have to do any more. But funnily enough, starting with this small sample, you could end up testing more subjects than necessary! For example, if the correlation in the populations is moderate (~0.4), a sample of 45 will sometimes produce a small correlation (~0.2), and when that happens you're supposed to test about 300 subjects on the next round. But if you had opted for, say, 200 to start with, you'd be unlikely to have to test another 150 on the next round.

But there's an acceptable cheat's way around this problem that allows you to start with 45 every time. All you do is set an upper limit on the number of subjects you will test, then take the limit off. For example, start with 45 subjects, but if the next round requires 250 more, you test only 100. Then you work out how many more you need from the total of 145, and test them.

However you do it, you'll get there in the end. And the answer will be trustworthy: I've found that the greatest bias occurs for correlations around 0.7-0.8, but it is only 0.01. This amount of bias--5% of the confidence interval--is negligible. What's more, the bias is insensitive to the initial sample size, and there is no noticeable extra bias when you set reasonable limits to the sample size on each extra round of sampling (e.g. 100 on the first round, 200 on the second and/or higher rounds). So even if you haven't got the resources to go to the full 400 subjects, you can still get a practically unbiased estimate of the correlation, albeit with a less-than-ideal confidence interval for the smallest correlations.

Adjusting for Imperfect Validity

Imperfect validity of one or both variables in the correlation degrades the apparent relationship between them. If the correlation you're chasing has a true value of r, and the validities are v and w, then the correlation you will observe, say r', is r·v·w, which is smaller than r. But when you write up the study, you will say that the correlation in the population is r'/(v·w). In other words, you inflate the observed correlation by a factor 1/(v·w), which is, or course, greater than 1. Uh huh! So that means the confidence interval is also inflated by the same factor. Curses, that means we'll need more subjects to make sure the larger correlation still has a confidence interval of 0.20. In fact, the final number of subjects is inflated by a factor 1/(v²w²). This factor popped up in the estimation of sample size using the traditional approach. You can use it on the fly, but it's a bit tricky. You have to inflate all sample sizes by the same factor on the way to detecting the correlation.

Here's an example. Suppose the validity correlations are 0.90 and 0.80. Overall that's 0.72, and 0.72² is 0.52. So start with 45/0.52 or 87 subjects. Suppose you get a correlation of 0.35. For perfect validity that would be a correlation of 0.35/0.72 or 0.49. On the graph that's equivalent to 220 subjects, but that's for perfect validity, so you need 220/0.52 or 423 subjects. So test 423 - 87 = 336 subjects. And so on. Mind-boggling, I'm afraid. It's all much simpler if you use the spreadsheet!

Nearly Perfect Correlations

You'll notice I've omitted correlations in the nearly perfect range on the graph for estimating sample sizes. If a correlation is this high, the relationship it represents is probably a reliability or a validity, or it may be a linear relationship used for predicting something. Confidence intervals less than 0.20 are needed for these correlations. Exactly how much less is a difficult question that I'm still working on.

Meanwhile, start with a sample of about 15 and see what you get for the correlation and for its confidence limits. You'll almost certainly find that the lower confidence limit is too low, unless you're lucky enough to get a correlation of 0.98 or 0.99. So you'll need more subjects. Estimate the sample size for the next round using the rule that the width of the interval is approximately inversely proportional to the square root of the sample size. Then test the extra subjects, recalculate the correlation and its confidence limits, and go to another round if necessary.

For example, let's suppose you get a correlation of 0.91 with 15 subjects. The 95% confidence limits are 0.97 and 0.75. Well, if the correlation is really 0.97, that's great for every possible purpose. But 0.75 is hopeless for applications requiring an almost perfect correlation! Obviously you need to narrow down the confidence interval. Halving the interval would help, which means a total of 4x as many subjects, or another 45. Test them, add them to the original 15, then recalculate. Suppose you get 0.93. The 95% confidence limits are now 0.96 and 0.89. Whether you stop at this point or go to another round of testing depends on whether 0.89 makes a big difference compared with 0.96, for the application you have in mind. I'd stop there if I was defining the validity of a variable for the purpose of seeing how many extra subjects I might need in a big cross-sectional study. I'd want to narrow down the interval a bit more if I wanted to use the underlying linear relationship to predict things like body fat from skinfold thickness. And I'd probably want to narrow it down more if the correlation was a reliability I was using to predict a sample size in a longitudinal study, using the old-fashioned approach.

For another example, imagine that you got a correlation of 0.98 with your initial sample of 15. The confidence limits are 0.96 and 0.99. No need to test any more subjects!

Go to: Next · Previous · Contents · Search · Home

A New View of Statistics	© 1997 Will G Hopkins
Go to: Next · Previous · Contents · Search · Home