Simulation for means on the fly

A New View of Statistics

© 1997 Will G Hopkins

Go to: Previous · Contents · Search · Home

On The Fly Sampling for the Effect-Size Statistic in CROSS-SECTIONAL STUDIES.

SAS program to see how much bias there is when you use acceptable width of confidence interval as the stopping rule.

In summary, the bias is negligible (5% of the confidence interval or less) when sample standard deviations are used. Bias can reach 7-8% of the confidence interval when the population SD is used, but only for very large effects; otherwise bias is 5% of the confidence interval or less for large effects or less. The accuracy of the confidence interval predicted by Becker formulae needs further investigation. I'm not sure how to go about that yet. At this stage all I did was compare the Becker predictions with the confidence limits from the sampling distribution. The lower limit agreed nicely, but the upper limit was out when there was bias, as you would expect. These findings apply also to the method for longitudinal studies with or without a control group.

This page is long and complex. The main simulation is further down. But first, I had to produce the data shown in the graph of sample size vs effect size. To do that, I used the special cases of the ES sitting in the middle of each step of the magnitude scale, and found the sample sizes that gave confidence limits falling on the boundaries of the step. Then I just fit a smooth curve to the points on the graph using the curve fit in Excel (not a polynomial).

Here's how I did those special cases.
 
For effect sizes, the boundaries of each step of the magnitude scale 
corresponding to the correlations that define the steps are as follows
(derived from ES=2sqrt(r**2/(1 - r**2)):
     trivial-small--medium-large-v.large
 r: 0.00  0.10   0.30   0.50   0.70   0.90
ES: 0.00  0.20   0.63   1.16   1.96   4.13
 
I then used simulation to get the N for each ES in the "middle" of each 
step. The middles when you use a sample SD are offset from the exact middle 
for larger ES.  They correspond to the correlations in the middle. The 
middles when you use the population SD are exactly in the middle of the ES 
range.
 
By trial and error, I used the program below to find the values of N 
that give a confidence interval just spanning the step.  Here are the values 
of ES and N:
 
Using sample SD:
ES: 0.00  0.41  0.87  1.53  2.87
N:  380   350   270   128   32
 
Using population SD
(the ESs sit symmetrically in the middle of the step):
ES: 0.00  0.41  0.89  1.56  3.04
N:  380   332   222   94    14
(The values of N for a given ES for a population SD could also have been
derived from first principles using: conf int = 2*t*sqrt(4/N),where t is 
the value of the t statistic for 2N-2 degrees of freedom and cumulative
probability of 0.975.)
 
OK, here's the simulation that gave the above data.  I had to tweak
the values of startn until I got it right for each step.
 
 
options linesize=80;
options pagesize=30;
 
%macro start;
data dat1;
do trial=1 to &ntrials;
  do id=1 to &startn/2;
    y1=rannor(0);
    y2=rannor(0)+&es;
    output;
   end;
  end;
drop id;
 
%mend;
 
%let ntrials=1000;
%let es=2.87;
%let startn=32;
%start;
 
proc means noprint data=dat1;
var y1 y2;
output out=dat n=n mean= var=vary1 vary2;
by trial;
 
data dat2;
set;
es_popsd=y2-y1; *sample es using population SD (=1 here);
samplees=(y2-y1)/sqrt((vary1+vary2)/2); *sample es;
crrctes=samplees*(1-3/(4*2*n-1)); *corrected sample es;
sdes=sqrt(2/n+crrctes**2/2/(n-1)); *Becker formula;
confint=2*1.96*sdes; *approx confidence interval;
cipred = 6.163550E-3*crrctes**4 + 1.547127E-1*crrctes**2 + 4.029553E-1;
confliml=sqrt(2/n)*tinv(0.025,2*n-2,crrctes*sqrt(n/2));
conflimu=sqrt(2/n)*tinv(0.975,2*n-2,crrctes*sqrt(n/2));
/*
The above are confidence limits predicted by Becker formula, with icc=0:
conflimu=sqrt(2*(1-icc)/n)*tinv(0.975,n-1,sqrt(n/(2*(1-icc)))*crrctes)
(included as an extra check on everything!).
*/
/*
*include this when checking n vs ES for population SD;
proc univariate noprint;
var es_popsd;
output mean=mean pctlpre=Q pctlpts=2.5 50 97.5;
 
proc print noobs;
format _numeric_ 5.2;
title "Sampling distn for es with known popn sd in x-sect studies, es=&es n=&startn trials=&ntrials";
 
run;
*/
 
proc univariate noprint;
var samplees;
output mean=mean pctlpre=Q pctlpts=2.5 50 97.5;
 
proc print noobs;
format _numeric_ 5.2;
title "Sampling distn for uncorrected es in x-sect studies, es=&es n=&startn trials=&ntrials";
 
proc univariate noprint data=dat2;
var crrctes;
output mean=mean pctlpre=Q pctlpts=2.5 50 97.5;
 
proc print noobs;
format _numeric_ 5.2;
title "Sampling distn for unbiased es in x-sect studies, es=&es n=&startn trials=&ntrials";
 
proc means mean std min max maxdec=2 data=dat2;
var samplees crrctes confint cipred confliml conflimu;
title "Stats for es in x-sectional studies, es=&es n=&startn trials=&ntrials";
 
run;
 
 
 
Now, the simulation to check for bias.
 
I intended originally to derive equations for the curves in the N vs ES 
figure, then use them to predict the necessary sample size for a given ES, 
then continue sampling iteratively until the sample size was more than the 
predicted sample size.  Unfortunately polynomials do not give satisfactory 
curve fits to the data for N vs ES in either case (sample or popn SD). I 
therefore used width of confidence intervals and the "1 over root n" 
strategy, as decribed for longitudinal studies.
 
I derived an equation for the acceptable width of confidence interval for a 
given ES simply from the boundaries of the steps in the magnitude scale, as 
described in the simulation for effect size in longitudinal studies.
 
I then used this equation to predict the desired conf int (cipred)
for a given sample ES. I divided that into the actual confidence interval, 
squared the result, and multiplied by the current sample size to get the 
next sample size.  Except that I used N-1 rather than N in this proportional 
calculation, because for large ES the rule is more like "1 over root(N-1)", 
and for small ES, N is so large it doesn't matter if you use N-1.
 
I calculated the actual confidence interval for the given ES using an 
approximation.  The true value of the confidence interval is given by the 
non-central t statistic, as shown in the output step of the simulation.  I 
used confint=2*1.96*sdes, where sdes=sqrt(2/n+crrctes**2/2/(n-1)), where 
crrctes is the corrected (unbiased) estimate of the effect size.  I used 
this approximation because I wanted to put it on an Excel spreadsheet, but 
Excel doesn't currently offer non-central t stats.  The approximation is 
biased high for large ES, so I used the uncorrected ES in the prediction 
formula to help offset the bias. 
 
 
Here's the simulation when the sample SD is used.  Below it is the 
simulation for population SD.
 
 
%macro start;
data dat1;
do trial=1 to &ntrials;
  do id=1 to &startn/2;
    y1=rannor(0);
    y2=rannor(0)+&es;
    output;
   end;
  end;
drop id;
 
%mend;
 
%let ntrials=500;
%let es=1.56;
%let startn=40; *initial size (=startn/2 in each group);
%let nmax1=100; *total size limit for 1st interation;
%let nmax2=200; *total size limit for 2nd interation;
%start;
 
data dat0;
set;
dataset="initial";
 
proc means noprint data=dat1;
var y1 y2;
output out=dat n=n mean= var=vary1 vary2;
by trial;
 
data dat2;
set;
iter=1;
samplees=(y2-y1)/sqrt((vary1+vary2)/2);
crrctes=samplees*(1-3/(4*2*n-1));
sdes=sqrt(2/n+crrctes**2/2/(n-1));
confint=2*1.96*sdes;
*cipred = 6.163550E-3*crrctes**4 + 1.547127E-1*crrctes**2 + 4.029553E-1;
cipred = 6.163550E-3*samplees**4 + 1.547127E-1*samplees**2 + 4.029553E-1;
if cipred<confint then do;
  nnew=round((n-1)*(confint/cipred)**2)-n+1;
  if nnew then do;
    if nnew+n>&nmax1/2 then nnew=&nmax1/2-n;
    do id=n+1 to nnew+n;
      y1=rannor(0);
      y2=rannor(0)+&es;
      output;
      end;
    end;
  end;
keep trial y1 y2 nnew iter;
 
 
*2nd iteration;
 
data dat1;
set dat1 dat2;
 
proc sort;
by trial;
 
proc means noprint data=dat1;
var y1 y2;
output out=dat n=n mean= var=vary1 vary2;
by trial;
 
data dat2;
set;
iter=2;
samplees=(y2-y1)/sqrt((vary1+vary2)/2);
crrctes=samplees*(1-3/(4*2*n-1));
sdes=sqrt(2/n+crrctes**2/2/(n-1));
confint=2*1.96*sdes;
*cipred = 6.163550E-3*crrctes**4 + 1.547127E-1*crrctes**2 + 4.029553E-1;
cipred = 6.163550E-3*samplees**4 + 1.547127E-1*samplees**2 + 4.029553E-1;
if cipred<confint then do;
  nnew=round((n-1)*(confint/cipred)**2)-n+1;
  if nnew then do;
    if nnew+n>&nmax2/2 then nnew=&nmax2/2-n;
    do id=n+1 to nnew+n;
      y1=rannor(0);
      y2=rannor(0)+&es;
      output;
      end;
    end;
  end;
keep trial y1 y2 nnew iter;
 
*3nd iteration;
 
data dat1;
set dat1 dat2;
 
proc sort;
by trial;
 
proc means noprint data=dat1;
var y1 y2;
output out=dat n=n mean= var=vary1 vary2;
by trial;
 
data dat2;
set;
iter=3;
samplees=(y2-y1)/sqrt((vary1+vary2)/2);
crrctes=samplees*(1-3/(4*2*n-1));
sdes=sqrt(2/n+crrctes**2/2/(n-1));
confint=2*1.96*sdes;
*cipred = 6.163550E-3*crrctes**4 + 1.547127E-1*crrctes**2 + 4.029553E-1;
cipred = 6.163550E-3*samplees**4 + 1.547127E-1*samplees**2 + 4.029553E-1;
if cipred<confint then do;
  nnew=round((n-1)*(confint/cipred)**2)-n+1;
  if nnew then do;
    do id=n+1 to nnew+n;
      y1=rannor(0);
      y2=rannor(0)+&es;
      output;
      end;
    end;
  end;
keep trial y1 y2 nnew iter;
 
 
 
*output results;
 
data dat1;
set dat1 dat2;
dataset="final";
 
data datboth;
set dat0 dat1;
 
proc sort;
by dataset trial;
 
proc means noprint data=datboth;
var y1 y2;
output out=dat n=n mean= var=vary1 vary2;
by dataset trial;
 
data dat2;
set;
samplees=(y2-y1)/sqrt((vary1+vary2)/2);
crrctes=samplees*(1-3/(4*2*n-1));
sdes=sqrt(2/n+crrctes**2/2/(n-1));
confint=2*tinv(0.975,2*n-2)*sdes;
confliml=sqrt(2/n)*tinv(0.025,2*n-2,crrctes*sqrt(n/2));
conflimu=sqrt(2/n)*tinv(0.975,2*n-2,crrctes*sqrt(n/2));
 
proc means noprint;
var n crrctes confint confliml conflimu;
by dataset;
output mean=;
 
proc print noobs;
format _numeric_ 5.2 n 4.;
var dataset n crrctes confint confliml conflimu;
title "ES stats for es=&es startn=&startn";
title2 "N is the final number in EACH group";
 
data datfinal;
set dat2;
if dataset="final";
 
proc univariate noprint;
var crrctes;
output mean=mean pctlpre=Q pctlpts=2.5 50 97.5;
 
proc print noobs;
format _numeric_ 5.2;
title "Sampling distn for es, es=&es n=&startn";
title2 "for corrected final sample effect size";
 
proc means mean std min max maxdec=0 data=datfinal;
var n;
title "Stats for final sample size  es=&es n=&startn";
title2 "N is the final number in EACH group";
 
 
proc sort data=dat1;
by trial iter;
 
proc means noprint;
var nnew;
output mean=;
by trial iter;
 
proc sort;
by iter;
 
proc means noprint;
var nnew;
by iter;
output n=n mean= std=std min=min max=max;
 
data;
set;
if iter;
 
proc print noobs;
var iter n nnew std min max;
format _numeric_ 5.0;
title "Number of extra observations at each iteration in EACH group";
title2 "for es=&es total startn=&startn total nmax1=&nmax1 total nmax2=&nmax2";
 
run;
 
 
*****************;
/*
Simulation using population SD.
 
The prediction formula is slightly different for large ES in this case,
because the ES sits in the middle of the interval:
cipred = 3.575348E-3*samplees**4 + 1.565327E-1*samplees**2 + 4.015905E-1;
 
Width of confidence interval is also now strictly proportional to 1/root(n);
*/
 
%macro start;
data dat1;
do trial=1 to &ntrials;
  do id=1 to &startn/2;
    y1=rannor(0);
    y2=rannor(0)+&es;
    output;
   end;
  end;
drop id;
 
%mend;
 
%let ntrials=500;
%let es=1.54;
%let startn=40; *initial size (=startn/2 in each group);
%let nmax1=100; *total size limit for 1st interation;
%let nmax2=200; *total size limit for 2nd interation;
%start;
 
data dat0;
set;
dataset="initial";
 
proc means noprint data=dat1;
var y1 y2;
output out=dat n=n mean= var=vary1 vary2;
by trial;
 
data dat2;
set;
iter=1;
samplees=(y2-y1);
sdes=sqrt(2/n);
confint=2*1.96*sdes;
cipred = 3.575348E-3*samplees**4 + 1.565327E-1*samplees**2 + 4.015905E-1;
if cipred<confint then do;
  nnew=round(n*(confint/cipred)**2)-n;
  if nnew then do;
    if nnew+n>&nmax1/2 then nnew=&nmax1/2-n;
    do id=n+1 to nnew+n;
      y1=rannor(0);
      y2=rannor(0)+&es;
      output;
      end;
    end;
  end;
keep trial y1 y2 nnew iter;
 
 
*2nd iteration;
 
data dat1;
set dat1 dat2;
 
proc sort;
by trial;
 
proc means noprint data=dat1;
var y1 y2;
output out=dat n=n mean= var=vary1 vary2;
by trial;
 
data dat2;
set;
iter=2;
samplees=(y2-y1);
sdes=sqrt(2/n);
confint=2*1.96*sdes;
cipred = 3.575348E-3*samplees**4 + 1.565327E-1*samplees**2 + 4.015905E-1;
if cipred<confint then do;
  nnew=round(n*(confint/cipred)**2)-n;
  if nnew then do;
    if nnew+n>&nmax2/2 then nnew=&nmax2/2-n;
    do id=n+1 to nnew+n;
      y1=rannor(0);
      y2=rannor(0)+&es;
      output;
      end;
    end;
  end;
keep trial y1 y2 nnew iter;
 
*3nd iteration;
 
data dat1;
set dat1 dat2;
 
proc sort;
by trial;
 
proc means noprint data=dat1;
var y1 y2;
output out=dat n=n mean= var=vary1 vary2;
by trial;
 
data dat2;
set;
iter=3;
samplees=(y2-y1);
sdes=sqrt(2/n);
confint=2*1.96*sdes;
cipred = 3.575348E-3*samplees**4 + 1.565327E-1*samplees**2 + 4.015905E-1;
if cipred<confint then do;
  nnew=round(n*(confint/cipred)**2)-n;
  if nnew then do;
    do id=n+1 to nnew+n;
      y1=rannor(0);
      y2=rannor(0)+&es;
      output;
      end;
    end;
  end;
keep trial y1 y2 nnew iter;
 
 
 
*output results;
 
data dat1;
set dat1 dat2;
dataset="final";
 
data datboth;
set dat0 dat1;
 
proc sort;
by dataset trial;
 
proc means noprint data=datboth;
var y1 y2;
output out=dat n=n mean= var=vary1 vary2;
by dataset trial;
 
data dat;
set;
samplees=(y2-y1);
sdes=sqrt(2/n);
confint=2*1.96*sdes; *normal distn, not t, when use population SD;
confliml=samplees-confint/2;
conflimu=samplees+confint/2;
 
proc means noprint;
var n samplees confint confliml conflimu;
by dataset;
output mean=;
 
proc print noobs;
format _numeric_ 5.2 n 4.;
var dataset n samplees confint confliml conflimu;
title "ES stats es=&es startn=&startn";
title2 "N is the final number in EACH group";
title3 "for ES calculated from population SD";
 
data datfinal;
set dat;
if dataset="final";
 
proc univariate noprint;
var samplees;
output mean=mean pctlpre=Q pctlpts=2.5 50 97.5;
 
proc print noobs;
format _numeric_ 5.2;
title "Sampling distn for es, es=&es n=&startn";
title2 "for final sample effect size";
 
proc means mean std min max maxdec=0 data=datfinal;
var n;
title "Stats for final sample size  es=&es n=&startn";
title2 "N is the final number in EACH group";
 
 
proc sort data=dat1;
by trial iter;
 
proc means noprint;
var nnew;
output mean=;
by trial iter;
 
proc sort;
by iter;
 
proc means noprint;
var nnew;
by iter;
output n=n mean= std=std min=min max=max;
 
data;
set;
if iter;
 
proc print noobs;
var iter n nnew std min max;
format _numeric_ 5.0;
title "Number of extra observations at each iteration in EACH group";
title2 "for es=&es total startn=&startn total nmax1=&nmax1 total nmax2=&nmax2";
 
run;

Go to: Previous · Contents · Search · Home

A New View of Statistics	© 1997 Will G Hopkins
Go to: Previous · Contents · Search · Home