New View of Statistics: Reliability Calculations

A New View of Statistics

© 2000 Will G Hopkins

Go to: Next · Previous · Contents · Search · Home

Summarizing Data:
PRECISION OF MEASUREMENT continued

CALCULATIONS FOR RELIABILITY
Make sure you understand the page on reliability before tackling this page. I explain here how to analyze data for two trials using simple but effective methods. To combine three or more trials you need more sophisticated procedures, such as analysis of variance or modeling variances. I go into heaps of detail about checking for non-uniform error in your data, and I have a few words on biased estimates of reliability. Finally, you can download a spreadsheet for calculating reliability between consecutive pairs of trials, complete with raw and percent estimates and confidence limits for typical error, change in mean, and retest correlation. The spreadsheet has data adapted from real measurements of skinfold thickness of athletes.

Two Trials
Analyzing two trials is straightforward. All the necessary calculations are included in the spreadsheet for reliability. When you have three or more trials, I strongly recommend that you first do separate analyses for consecutive pairs of trials (Trial1 with Trial2, Trial2 with Trial3, Trial3 with Trial4, etc.). That way you will see if there are any substantial differences in the typical error or change in the mean between pairs of trials. Such differences are indicative of learning or practice effects. If there is no substantial change in the typical error between three or more consecutive trials, analyze those trials all together to get greater precision for your estimates of reliability.

Typical Error
The values of the change score or difference score for each subject yield the typical error. Simply divide the standard deviation of the difference score by root2. For example, if the difference scores are 5, -2, 6, 0, and -3, the standard deviation of these scores is 4.1, so the typical error is 4.1/root2 = 2.9. This method for calculating the typical error follows from the fact that the variance of the difference score (s²_diff) is equal to the sum of the variances representing the typical error (s) in each trial: s²_diff = s² + s², so s = s_diff/root2.

To derive this within-subject variation as a coefficient of variation (CV), log-transform your variable, then do the same calculations as above. The CV is derived from the typical error (s) of the log-transformed variable via the following formula:
CV = 100(e^s - 1),
which simplifies to 100s for s<0.05 (that is, CVs of less than 5%). You will also meet this formula on the page about log-transformation, where I describe how to represent the standard deviation of a variable that need log transformation to make it normally distributed. As I describe on that page, I find it easier to interpret the standard deviation and shifts in the mean if I make the log transformation 100x the log of the variable. That way the typical error and shifts in the mean are already approximately percents. To convert them to exact percents, the formula becomes 100(e^s/100 - 1).

We sometimes show the typical error with a ± sign in front of it, to indicate that a subject's observed value varies by typically ± the typical error whenever we measure it. For example, the typical error in a monthly measurement of body mass might be ±1.5 kg. When we express the typical error as a CV, we can also think of it as ±2.1% (if the subject weighed 70 kg), but strictly speaking it's more appropriate to show the variation as ×/÷1.021. In other words, from month to month the body mass is typically high by a factor of 1.021 or low by a factor of 1/1.021. These factors come from the assumption that the log-transformed weight rather than the weight itself is normally distributed. Now, ×1.021 is the same as 1 + 0.021, and 1/1.021 is almost exactly 1 - 0.021, so it's OK to show the CV as ą2.1%. But when the CV is bigger than 5% or so, the use of the minus sign gets more inaccurate. For example, if the CV is 35%, the value of the variable varies typically by a factor of 1.35 to 1/1.35, or 1.35 to 0.74, or 1 + 0.35 to 1 - 0.26, which is certainly not the same as 1 + 35% to 1 - 35%. You can still write ą35%, but be aware that the implied typical variation in the observed value is ×/÷1.35.

Changes in the Mean
A simple way to get these is to do paired t tests between the pairs of trials. Do it on the log-transformed variable and you'll get approximate percent changes in the mean between trials. Use the same formulae as for the CV to turn these into exact percent changes.

Retest Correlation
A simple Pearson correlation is near enough. If the variable is closer to normally distributed after log transformation, you should use the correlation derived from the log-transformed variable. Alternatively calculate the intraclass correlation coefficient from the formula ICC = (SD² - sd²)/SD², where SD is the between-subject standard deviation and sd is the within-subject standard deviation (the typical or standard error of measurement). These standard devations can come from different subjects, if you want to estimate the retest correlation by combining the error in one study applied to a different group. The spreadsheet for the ICC has this formula and confidence limits for the ICC.

Note that the above relationship allows you to calculate the typical error from a retest correlation, when you also know the between-subject standard deviation: sd = SD·root(1 - r). Strictly speaking the r should be the intraclass correlation, but there is so little difference between the Pearson and the ICC, even for as few as 10 subjects, that it doesn't matter.

Three or More Trials
I deal here with the procedures for getting the average reliability across three or more trials. The simplest and possibly the most practical or realistic procedure is simply to average the reliability for the consecutive pairs of trials. Well, it's not that simple to average the standard deviations representing the typical error, because you have to weight their squares by the degrees of freedom, then take the square root. I've done it for you in the reliability spreadsheet. The resulting average is the typical error you would expect for the average time between consecutive pairs of trials, and you usually make that the same (e.g., 1 week) when you design the reliability study.

There are more complicated procedures for getting the average reliability, using ANOVA or repeated-measures analyses. There is no spreadsheet for these procedures. I'll describe the usual approach, which is based on the assumption that there is a single random error of measurement that is the same for every subject for every trial. That is, whenever you take a measurement, a random number comes out of a hat and gets added to the true value. The numbers in the hat have a mean of zero, and their standard deviation is the error of measurement that you want to estimate. Or to put it another way, no matter which pairs of trials you select for analysis, either consecutive (e.g., 2+3) or otherwise (e.g., 1+4), you would expect to get the same error of measurement. This assumption may not be particularly realistic, if, for example, you did 5 trials each one week apart: the error of measurement between the first and last trial is likely to be greater than between trials closer together. If you estimate the error assuming it is the same, you will get something that is too large for trials close together and too small for trials further apart.

To understand this section properly, read the pages on statistical modeling. In a reliability study or analysis, you are asking this question: how well does the identity of a subject predict the value of the dependent variable, when you take into account any shift in the mean between tests? (If the variable is reliable, the value of the variable is predicted well from subject to subject. If the variable is unreliable, it isn't much help to know who the subject is.) So the model is simply:

dependent variable <= subject test

In other words, it's a two-way analysis of variance (ANOVA) of your variable with subject and test as the two effects. Do NOT include the interaction term in the model! The analysis is not done as a repeated-measures ANOVA, because the subject term is included in the model explicitly. Experts with the Statistical Analysis System can use a repeated-measures approach with mixed modeling, as described below in modeling variances.

Typical Error
The root mean-square error (RMSE) in the ANOVA is a standard deviation that represents the within-subject variation from test to test, averaged over all subjects. If your stats package doesn't provide confidence limits for it, use the spreadsheet for confidence limits.

If you use a one-way ANOVA in which the only effect is subject, the RMSE will be contaminated by any change in the mean between trials. (In a two-way ANOVA, the test effect takes out any change in the mean.) The resulting RMSE represents the total error of measurement. You can also derive the total error by calculating each subject's standard deviation, squaring them, averaging them over all subjects, then taking the square root. This procedure works for two trials, too. I don't recommend total error as a measure of reliability, because you don't know how much of the total error is due to change in the mean and how much is due to typical error.

Changes in the Mean
Your stats program should be able to give you confidence limits or p values for each consecutive pairwise comparison of means. If it gives you only the p values, convert these to confidence limits using the spreadsheet for confidence limits.

Shifts in the mean and typical error as percents are derived from analysis of the log-transformed variable. See the previous section for the formula.

Retest Correlation
Scrutinize the output from the ANOVA and find something called the F value for the subject term. The retest correlation, calculated as an intraclass correlation coefficient (ICC), is derived from this F value:

ICC = (F - 1)/(F + k - 1),

where k = (number of observations - number of tests)/(number of subjects - 1). In the case of no missing values, number of observations = (number of tests)·(number of subjects), so k is simply the number of tests. For example, a reliability study of gymnastic skill consisted of 3 tests on 10 subjects. There were 28 observations instead of 30, because two athletes missed a test each, so k = (28-3)/(10-1) = 2.78. The F ratio for subjects was 56. Reliability was therefore (56-1)/(56+2.78-1) = 0.95.

I used to have this formula in the spreadsheet for confidence limits, then I removed it for many years, thinking that people don't need it. Recently (2009) I've started expressing predictability of competitive athletic performance as an ICC, and I found I do need it and related formulae. So they're back, in their own spreadsheet for the ICC.

The ICC formula came from Bartko (1966), although he used sums of squares rather than F values. His formula for k when there are missing values is complex and appears not to be the same as the one I have given above. The random statement in Proc Glm of the Statistical Analysis System generates k, and I have found by trial and error that my formula gives the exact value.

Your stats program will give you p value for the subject term and the test term. The p value for subject is not much use. It tells you whether the ICC is statistically significantly different from zero, but that's usually irrelevant. The ICC is usually at 0.7-0.9 or more, so there's no way it could be zero. More important are the confidence limits for the ICC and for the typical error. The p value for test addresses the issue of overall differences between the means of the tests, but with more than two tests you should pay more attention to the significance of consecutive pairwise differences (to see where any learning effects fade out). I'd prefer you to show the confidence intervals for the differences, rather than the p values. If your stats program doesn't give confidence intervals, use the spreadsheet for confidence limits for the typical error, and the spreadsheet for the ICC for confidence limits for the ICC. By the way, stats programs don't provide a p value for the typical error, because there's no way it can be zero.

The typical error or root mean square error (RMSE) from one group of subjects can be combined with the between-subject standard deviation (SD) of a second group to give the reliability correlation for the second group. This approach is handy if you do repeated testing on only a few subjects to get the within-subject variation, but you want to see how that translates into a reliability correlation when you combine it with the SD from single tests on a lot more subjects. You simply assume that the within-subject variation is the same for both groups, then apply the formula that defines the reliability correlation:

ICC = (SD² - typical error²)/SD².

(This formula can be derived simply enough from the definition of correlation as the covariance of two variables divided by the product of their standard deviations.) The spreadsheet for the ICC deals with this scenario, too.

For non-normal variables, your analyses in the main study are likely to be non-parametric. So it makes sense to derive a non-parametric reliability. Just do the ANOVA on the rank-transformed variable. The within-subject variation is hard to interpret, though.

Attention sport psychologists: if the repeated "tests" are simply the items of an inventory, the alpha reliability of the items (i.e., the consistency of the mean of the items) is (F - 1)/F.

For nominal variables (variables with categories as values rather than numbers), the equivalent of the ICC is the kappa coefficient. Your stats program should offer this option in the output for the procedure that does chi-squared tests or contingency tables.

Modeling Variances for Reliability
A reliability studiy is just an experiment without an intervention, so any method for analyzing an experiment will work for a reliability study. Modeling variances is one such method. In SAS, you model variances with Proc Mixed, using the model for simple repeated measures. The procedure produces the within variance and its confidence limits. It also produces the retest correlation as an intraclass correlation, but to get its confidence limits you'll have to use the spreadsheet for confidence limits. I don't know whether the other major stats programs have procedures like Proc Mixed for modeling variances.

Non-Uniform Error of Measurement
I've already introduced the concept of non-uniform error (heteroscedasticity) to describe the situation when some subjects are more reliable than others. You should always check whether your typical error is non-uniform, but you will need plenty of subjects to make any definite conclusions. One good way to check is to calculate the typical error for different subgroups. Often the typical error varies with the magnitude of the variable, so try splitting your subjects into a top half and a bottom half and analyzing them separately. For the data on skinfold thickness in the spreadsheet for reliability, the typical errors of the bottom and top halves are 0.48 and 1.03 mm (not shown on the spreadsheet--you'll have to do it yourself). It certainly looks like subjects with a bigger sum of skinfolds have more variability, but with only 10 subjects in each half, there's a lot of uncertainty about just how big the difference really is.

Depending on the sample and the variable, you should also analyze the typical errors for subgroups differing in sex, athletic status, age group, and so on. You sometimes find that any differences in reliability between such groups arise mainly from differences in the magnitude of the variable; for example, if log transformation removes any non-uniformity of error related to the magnitude of the variable, you will probably find that the subgroups for sex, age or whatever now have the same percent typical errors.

A more statistical approach to checking for differences in the typical error between subjects is to look at the scatter of points in the plot of the two trials. The scatter at right angles to the line of identity should be the same wherever you are on the line (and for whatever subgroups). If there is more scatter at one end, the subjects at that end have a bigger typical error. It's often difficult to tell whether the scatter is uniform on the plot, especially when reliability is high, because the points are all too close to the line. An easier way is to plot the change score against the average of the two trials for each subject. I have provided such a plot on the spreadsheet. (It's not obvious even on this plot that the subjects with bigger skinfolds have more variability. Again, more subjects are needed.) I've also provided a complete analysis for the log-transformed variable. A uniform scatter of the change scores after log-transformation implies that the coefficient of variation (CV, or percent typical error) is the same for all subjects, and the analysis of the log-transformed variable provides the best estimate. Look at the plots of the difference scores and you will see that the scatter is perhaps a little more uniform after log transformation. When I analyzed the bottom and top halves of the log-transformed variable, I got CVs of 1.1% and 2.0%. These CVs are a little closer together than their corresponding raw typical errors, so it would be better to represent the mean typical error for the full sample as 1.7% rather than 0.83 mm. But really, you need more subjects...

When you analyze three or more trials using ANOVA or repeated measures, the equivalent of the difference scores is the residuals in the analysis, and the equivalent of the average of the two trials is the predicted values. The standard deviation of the residuals is the typical error, so if the residuals are bigger for some subjects (some predicteds), the typical error is bigger for those subjects. Try to coax your stats program into producing a plot of the residuals vs the predicteds. Click for more information about residuals and predicteds, and about bad residuals (heteroscedasticity).

Biased Estimates of Reliability
Some statisticians think mistakenly that reliability should be calculated with a one-way ANOVA, in which you leave out the term for the identity of the tests. The trouble is, a one-way ANOVA produces an estimate of retest correlation that is biased low for small samples, and it is even lower if the means differ between trials. The within-subject variation from the analysis is the same as the total error, which will be larger than the typical error when there is any systematic change in the mean between trials. Neither of these estimates of reliability should be used to estimate sample sizes for longitudinal studies.

The Pearson correlation coefficient is also a biased estimate of retest correlation: it is biased high for small sample sizes. For example, with only two subjects you always get a correlation of 1! For samples of 15 or more subjects, the ICC and the Pearson do not usually differ in the first two decimal places.

I used to think that limits of agreement were biased high for small samples, because I thought they were defined as the 95% confidence limits for a subject's change between trials. (The formula for confidence limits includes the t statistic, which is affected by sample size in such a way that the limits defined in this way would be biased high for small samples.) But apparently Bland and Altman, the progenitors of limits of agreement, did not define limits of agreement as 95% confidence limits; instead they defined them as a "reference range", generated by multiplying the typical error by 2.77, regardless of the size of the sample that is used to estimate the typical error. In other words, the limits of agreement represent 95% confidence limits for a subject's true change only if the typical error is derived from a large sample. With this definition, the limits of agreement are only as biased as the typical error.

Surprisingly, even the typical error is biased! Yes, the square of the typical error (a variance) is unbiased, so the square root of the variance must be biased low for small samples. In practical terms, typical errors derived from samples of, say, 10 subjects tested twice will look a bit smaller on average than typical errors derived from hundreds of subjects or many retests. This bias in the typical error does not affect any statistical computations involving the typical error.

Spreadsheet for Calculating Reliability
The spreadsheet computes the following measures of reliability between consecutive pairs of trials: change in the mean, typical error, retest correlation (Pearson and intraclass), total error, and limits of agreement. Data in the spreadsheet are from a study of the reliability of the sum of seven skinfolds for a group of athletes.

The spreadsheet now includes averages for the consecutive pairwise estimates of error, with confidence limits. This approach to combining more than two trials is probably more appropriate than the usual analysis of variance or repeated-measures analysis that I describe above (and which, in any case, I can't set up easily on a spreadsheet). I have also included averages of trial means and standard deviations, in case you want to report these as characteristics of your subjects.

Pairwise reliability analyses: Excel spreadsheet

See also the spreadsheet for the ICC, when you have between- and within-subject standard deviations and you want the ICC and its confidence limits, or you have the ICC and you want its confidence limits, or you have an F ratio from an ANOVA and you want the ICC and its confidence limits.

Go to: Next · Previous · Contents · Search · Home

Bartko JJ (1966). The intraclass correlation coefficient as a measure of reliability. Psychological Reports 19, 3-11

Last updated 17 May 09

A New View of Statistics	© 2000 Will G Hopkins
Go to: Next · Previous · Contents · Search · Home