Sportscience: Quantitative Research Design

QUANTITATIVE RESEARCH DESIGN

© 1998 Will G Hopkins

Quantitative Research · Types of Design · Samples · Sample Size · What to Measure

SUMMARY

The aim of quantitative research is to determine how one thing (a variable) affects another in a population.
Quantitative research designs are either descriptive (subjects measured once) or experimental (subjects measured before and after a treatment).
A descriptive study establishes only associations between variables. An experiment establishes causality.
A descriptive study usually needs a sample of hundreds or even thousands of subjects for an accurate estimate of the relationship between variables. An experiment, especially a crossover, may need only tens of subjects.
The estimate of the relationship is less likely to be biased if you have a high participation rate in a sample selected randomly from a population. In experiments, bias is also less likely if subjects are randomly assigned to treatments, and if subjects and researchers are blind to the identity of the treatments.
In all studies, measure everything that could account for variation in the outcome variable.
In an experiment, try to measure variables that might explain the mechanism of the treatment. In an unblinded experiment, such variables can help define the magnitude of any placebo effect.

QUANTITATIVE RESEARCH

Quantitative research is all about quantifying the relationships between variables. Variables are the things you measure on your subjects, which can be humans, animals, or cells. Variables can represent subject characteristics (e.g. weight, height, sex), the things you are really interested in (e.g. athletic performance, rate or injury, physiological, psychological or sociological variables), and variables representing the timing of measurements and nature of any treatments subjects receive (e.g. before and after a real drug or a sham drug). To quantify the relationships between these variables, we use values of effect statistics such as the correlation coefficient, the difference between means of something in two groups, or the relative frequency of something in two groups.

TYPES OF DESIGN

Research studies aimed at quantifying relationships are of two kinds: descriptive and experimental. In a descriptive study, no attempt is made to change behavior or conditions--you measure things as they are. In an experimental study you take measurements, try some sort of intervention, then take measurements again to see what happened.

Types of Research Design

Descriptive or observational

case
case series
cross-sectional
cohort or prospective or longitudinal
case-control or retrospective

Experimental or longitudinal or repeated-measures

without a control group
time series
crossover
with a control group

Descriptive Studies

Descriptive studies are also called observational, because you observe the subjects without otherwise intervening. The simplest descriptive study is a case, which reports data on only one subject; examples are studies of an outstanding athlete or of an athlete with an unusual injury. Descriptive studies of a few cases are called case series. In cross-sectional studies variables of interest in a sample of subjects are assayed once and analyzed. In prospective or cohort studies, some variables are assayed at the start of a study (e.g. dietary habits), then after a period of time the outcomes are determined (e.g. incidence of heart disease). Another label for this kind of study is longitudinal, although this term also applies to experiments. Case-control studies compare cases (subjects with a particular attribute, such as an injury or ability) with controls (subjects without the attribute); comparison is made of the exposure to something suspected of causing the cases, for example volume of high intensity training, or number of cigarettes smoked per day. Case-control studies are also called retrospective, because they focus on conditions in the past that might cause subjects to become cases rather than controls.

A common case-control design in the exercise science literature is a comparison of the behavioral, psychological or anthropometric characteristics of elite and sub-elite athletes: you are interested in what the elite athletes have been exposed to that makes them better than the sub-elites. Another type of study compares athletes with sedentary people on some outcome such as an injury, disease, or disease risk factor. Here you know the difference in exposure (training vs no training), so these studies are really cohort or prospective, even though the exposure data are gathered retrospectively at only one time point. They are therefore known as historical cohort studies.

Experimental Studies

Experimental studies are also known as longitudinal or repeated-measures studies, for obvious reasons. They are also referred to as interventions, because you do more than just observe the subjects.

In the simplest experiment, a time series, one or more measurements are taken on all subjects before and after a treatment. A special case of the time series is the so-called single-subject design, in which measurements are taken repeatedly (e.g. 10 times) before and after an intervention on one or a few subjects.

Time series suffer from a major problem: any change you see could be due to something other than the treatment. For example, subjects might do better on the second test because of their experience of the first test, or they might change their diet between tests because of a change in weather, and diet could affect their performance of the test. The crossover design is one solution to this problem. Normally the subjects are given two treatments, one being the real treatment, the other a control or reference treatment. Half the subjects receive the real treatment first, the other half the control first. After a period of time sufficient to allow any treatment effect to wash out, the treatments are crossed over. Any effect of retesting or of anything that happened between the tests can then be subtracted out by an appropriate analysis. Multiple crossover designs involving several treatments are also possible.

If the treatment effect is unlikely to wash out between measurements, a control group has to be used. In these designs, all subjects are measured, but only some of them--the experimental group--then receive the treatment. All subjects are then measured again, and the change in the control group is compared with the change in the experimental group.

If the subjects are assigned randomly to experimental and control groups or treatments, the design is known as a randomized controlled trial. Random assignment minimizes the chance that either group is not typical of the population. If the subjects are blind to the identity of the treatment, the design is a single-blind controlled trial. The control or reference treatment in such a study is called a placebo: the name physicians use for inactive pills or treatments that are given to patients in the guise of effective treatments. Blinding of subjects eliminates the placebo effect, whereby people react differently to a treatment if they think it is in some way special. In a double-blind study, the experimenter also does not know which treatment the subjects receive until all measurements are taken. Blinding of the experimenter is important to stop him or her treating subjects in one group differently from those in another. In the best studies even the data are analyzed blind, to prevent conscious or unconscious fudging or prejudiced interpretation.

Ethical considerations or lack of cooperation (compliance) by the subjects sometimes prevent experiments from being performed. For example, a randomized controlled trial of the effects of physical activity on heart disease has yet to be reported, because it is unethical and unrealistic to randomize people to 10 years of exercise or sloth. But there have been many short-term studies of the effects of physical activity on disease risk factors (e.g. blood pressure).

Quality of Designs

The various designs differ in the quality of evidence they provide for a cause-and-effect relationship between variables. Cases and case series are the weakest. A well-designed cross-sectional or case-control study can provide good evidence for the absence of a relationship. But if such a study does reveal a relationship, it generally represents only suggestive evidence of a causal connection. A cross-sectional or case-control study is therefore a good starting point to decide whether it is worth proceeding to better designs. Prospective studies are more difficult and time-consuming to perform, but they produce more convincing conclusions about cause and effect. Experimental studies are definitive about how something affects something else, and with far fewer subjects than descriptive studies! Double-blind randomized controlled trials are the best experiments.

Confounding is a potential problem in descriptive studies that try to establish cause and effect. Confounding occurs when part or all of a significant association between two variables arises through both being causally associated with a third variable. For example, in a population study you could easily show a negative association between habitual activity and most forms of degenerative disease. But older people are less active, and older people are more diseased, so you're bound to find an association between activity and disease without one necessarily causing the other. To get over this problem you have to control for potential confounding factors. For example, you make sure all your subjects are the same age, or you do sophisticated statistical analysis of your data to try to remove the effect of age on the relationship between the other two variables.

SAMPLES

You almost always have to work with a sample of subjects rather than the full population. But people are interested in the population, not your sample. To generalize from the sample to the population, the sample has to be representative of the population. The safest way to ensure that it is representative is to use a random selection procedure. You can also use a stratified random sampling procedure, to make sure that you have proportional representation of population subgroups (e.g. sexes, races, regions).

Selection bias occurs when the sample is not representative of the population. More accurately, a sample statistic is biased if the expected value of the statistic is not equal to the value of the population statistic. (The expected value is the average value from many samples drawn using the same sampling method.) A typical source of bias in population studies is age or socioeconomic status: people with extreme values for these variables tend not to take part in the studies. Thus a high compliance (the proportion of people approached who end up as subjects) is important in avoiding bias. Journal editors are usually happy with compliance rates of at least 70%.

Failure to randomize subjects to control and treatment groups in experiments can also produce bias: if you let people select themselves into the groups, or if you select the groups in any way that makes one group different from another, then any result you get might reflect the group difference rather than an effect of the treatment. For this reason, it's important to randomly assign subjects in a way that ensures the groups are balanced in terms of important variables that could modify the effect of the treatment (e.g. age, gender, physical performance). Randomize subjects to groups as follows: rank-order the subjects on the basis of the variable you most want to keep balanced (e.g. physical performance); split the list up into pairs (or triplets for three treatments, etc.); assign subjects in each pair to the treatments by flipping a coin; check the mean values of your other variables in the two groups, and reassign randomly chosen pairs to balance up these mean values. Human subjects may not be happy about being randomized, so you need to state clearly that it is a condition of taking part.

SAMPLE SIZE

How many subjects should you study? You can approach this crucial issue from the perspective of either statistical significance, confidence intervals, or using confidence intervals "on the fly".

Via Statistical Significance

Statistical significance is the old-fashioned and complicated approach. Your sample size has to be big enough for you to be sure you will detect the smallest worthwhile effect or relationship between your variables. To be sure means detecting the effect 80% of the time. Detect means getting a statistically significant result, which means you'd expect to see a more extreme result by chance only 5% of the time, if there was no effect at all (in other words, the p value for the effect has to be less than 0.05). Smallest worthwhile effect means, for example, the smallest effect that would make a difference to the lives of your subjects or to your interpretation of whatever you are studying. If you have too few subjects in your study and you get a statistically significant result, most people regard your finding as publishable. But if the result is not significant with a small sample size, it's regarded as unpublishable, because you can't say whether or not there is something going on.

Via Confidence Intervals

Using confidence intervals (or limits) is a more enlightened approach to sample-size estimation. You simply want enough subjects to allow you to put acceptable bounds on the estimate of the population value for the effect you are studying. Bounds usually means the 95% confidence limits: the limits within which the true or population value for the effect is likely to fall, where likely means 95% of the time. Acceptable means the upper limit and lower limit have to be so close together that any value of the effect within these limits will make little difference to your subjects or to your interpretation of whatever you are studying.

"On the Fly"

The smaller the sample size, the wider the confidence interval. If the observed effect is close to zero, the confidence interval has to be quite narrow, to exclude the possibility that the true (population) value could be substantially positive or substantially negative. But if the observed effect is large, a wide confidence interval doesn't matter so much. For this reason, I advocate sample size on the fly: start a study with a small sample size, then increase the number of subjects until you get a confidence interval that is appropriate for the magnitude of the effect that you end up with. Simulations show that the resulting magnitudes of effects are not substantially biased.

Effect of Research Design

The type of design you choose for your study has a major impact on the sample size. Descriptive studies need hundreds of subjects to give acceptable confidence intervals (or to ensure statistical significance) for small effects. Controlled trials generally need one-tenth as many, and crossovers need even less: one-quarter of the number for an equivalent trial with a control group. I give details on the stats pages at this site.

Effect of Validity and Reliability

The precision with which you measure things also has a major impact on sample size: the worse your measurements, the more subjects you need to lift the signal (the effect) out of the noise (the errors in measurement). Precision is expressed as validity and reliability. Validity represents how well a variable measures what it is supposed to. Validity is important in descriptive studies: if the validity of the main variables is poor, you may need thousands rather than hundreds of subjects. Reliability tells you how reproducible your measures are on a retest, so it impacts on experimental studies. The more reliable a measure, the less subjects you need to see a small change in the measure. For example, a controlled trial with 20 subjects in each group or a crossover with 10 subjects may be sufficient to characterize even a small effect, if the measure is highly reliable. See the details on the stats pages.

Pilot Studies

As a student researcher, you might not have enough time or resources to get a sample of optimum size. Your study can nevertheless be a pilot for a larger study. Pilot studies should be done to develop, adapt, or check the feasibility of techniques, or to calculate how big the final sample needs to be. In the latter case, the pilot should be performed with the same sampling procedure and techniques as in the larger study.

For experimental designs, a pilot study can consist of the first 10 or so observations of a larger study. If you get respectable confidence limits, there may be no point in continuing to a larger sample. Publish and move on to the next project or lab!

Meta-Analysis

If you can't test enough subjects to get an acceptably narrow confidence interval, you should still be able to publish your finding, because your study will at least set bounds on how big and how small the effect can be. Your finding can be combined with the findings of similar studies in something called a meta-analysis, which derives a confidence interval for the effect from several studies. If your study is not published, it can't contribute to the meta-analysis! Unfortunately, many reviewers and editors do not appreciate the importance of publishing studies with suboptimal sample sizes. They are still locked into thinking that only statistically significant results are publishable.

WHAT TO MEASURE

In any study, two groups of variables need to be measured: the characteristics of the subjects, and the independent and dependent variables defining the main research question. For experiments, you can also measure variables related to mechanisms underlying the effect of a treatment.

Characteristics of Subjects

You must report sufficient information about your subjects to identify the population group from which they were drawn. For human subjects, variables such as sex, age, height, weight, socioeconomic status, ethnic origin, training status, competitive status, and current or personal-best performance are common, depending on the focus of the study. Some of these variables might also be part of the research question.

In studies of endurance performance, some journal editors expect you to provide an estimate of maximum oxygen consumption to help characterize your subjects. I oppose this practice, because maximum oxygen consumption is not particularly reliable nor especially valid to indicate the performance level of athletes. Data on current competitive performance, preferably expressed as a percent of world-record performance, are more informative.

Dependent and Independent Variables

Usually you have a good idea of the question you want to answer. That question defines the main variables to measure. For example, if you are interested in enhancing sprint performance, your dependent variable (or outcome variable) is automatically some measure of sprint performance. Cast around for the best way to measure this dependent variable: you want a variable that is as valid as possible (for descriptive studies) or as reliable as possible (for experiments). Sometimes there is more than one dependent variable (e.g. endurance as well as sprint performance).

Next, identify all the things that could affect the dependent variable. These things are the independent variables: training, sex, the treatment in an experimental study, and so on.

For a descriptive study with a wide focus (a "fishing expedition"), your main interest is estimating the effect of everything that is likely to affect the dependent variable, so you include as many independent variables as resources allow. Beware though: the more effects you look for, the more likely the true value of at least one of them lies outside its confidence interval. For a descriptive study with a narrower focus (e.g. the relationship between training and performance), you still measure variables likely to be associated with the outcome variable (e.g. age-group, sex, competitive status), because you either restrict the sample to a particular subgroup defined by these variables (e.g. veteran male elite athletes) or control for the variables statistically.

For an experimental study, the main independent variable is the one indicating when the dependent variable is measured (e.g. before, during, and after the treatment). If there is a control group (as in controlled trials) or control treatment (as in crossovers), the identity of the group or treatment is another essential independent variable (e.g. Drug A, Drug B, placebo). Obviously these variables affect the dependent variable, and you automatically include them in any analysis. But there may be other variables that could affect the outcome. For example, the response of males to the treatment might be different from that of females, so once again you either restrict the study to one sex, or you analyze the data to take into account the possibility of a difference. Try to measure other variables that could explain individual differences in the response, such as training status, age, diet, or physiological variables from blood or exercise tests, where there are good reasons for believing such measures could have an effect. The statistical procedures for including such variables are complex, but the payoffs are narrower confidence intervals for the magnitude of the effect and valuable data on who will respond best and worst to the treatment.

Mechanism Variables

With experiments, the main challenge is to determine the magnitude and confidence intervals of the treatment effect. But sometimes you want to know the mechanism of the treatment--how the treatment works or doesn't work. To address this issue, try to find one or more variables that might connect the treatment to the outcome variable, and measure these at the same times as the dependent variable. For example, you might want to determine whether a particular training method enhanced strength by increasing muscle mass, so you might measure limb girths at the same time time as the strength tests. When you analyze the data, look for associations between change in limb girth and change in the strength. Remember to take into account the extent to which errors of measurement (validity and reliability of the variables) obscure any association.

This kind of approach is effectively a descriptive study on the difference scores of the variables, so it can provide only suggestive evidence for or against a particular mechanism. To understand this point, think about the example of the limb girths and strength: an increase in muscle size does not necessarily cause an increase in strength--other changes that you haven't measured might have done that. To really nail a mechanism, you have to devise another experiment aimed at changing the putative mechanism variable while you control everything else. But that's another research project. Meanwhile, it is sensible to use your current experiment to find suggestive evidence of a mechanism, provided it doesn't entail too much extra work or expense. And if it's research for a PhD, you are expected to measure one or more mechanism variables and discuss intelligently what the data mean.

Finally, a really useful application for mechanism variables: they can define the magnitude of placebo effects in unblinded experiments. In such experiments, there is always the nagging doubt that any treatment effect can be partly or wholly a placebo effect. But if you find a correlation between the change in the dependent variable and change in an objective mechanism variable--one that cannot be affected by the psychological state of the subject--then you can say for sure that the treatment effect is not all placebo. And the stronger the correlation, the smaller the placebo effect. The method works only if there are individual differences in the response to the treatment, because you can't get a correlation if every subject has the same change in the dependent variable. (Keep in mind that some apparent variability in the response between subjects is likely to be random error in the dependent variable, rather than true individual differences in the response to the treatment.)

Surprisingly, the objective variable can be almost anything, provided the subject is unaware of any change in it. In our example of strength training, limb girth is not a good variable to exclude a placebo effect: subjects may have noticed their muscles get bigger, so they may have expected to do better in a strength test. In fact, any noticeable changes could inspire a placebo effect, so any objective variables that correlate with the noticeable change won't be useful to exclude a placebo effect. Think about it. But if the subjects noticed no changes, other than a change in strength, and you found an association between change in blood lipids, say, and change in strength, then the change in strength cannot all be a placebo effect. Unless, of course, changes in blood lipids are related to susceptibility to suggestion...

QUANTITATIVE RESEARCH DESIGN	© 1998 Will G Hopkins
Quantitative Research · Types of Design · Samples · Sample Size · What to Measure